I'm trying to parse the DMOZ content/structures XML files into MySQL, but all existing scripts to do this are very old and don't work well. How can I go about opening a large (+1GB) XML file in PHP for parsing?
Answers
In PHP, you can read in extreme large XML files with the XMLReader
Docs:
$reader = new XMLReader();
$reader->open($xmlfile);
Extreme large XML files should be stored in a compressed format on disk. At least this makes sense as XML files have a high compression ratio. For example gzipped like large.xml.gz
.
PHP supports that quite well with XMLReader
via the compression wrappersDocs:
$xmlfile = 'compress.zlib://path/to/large.xml.gz';
$reader = new XMLReader();
$reader->open($xmlfile);
The XMLReader
allows you to operate on the current element "only". That means it's forward-only. If you need to keep parser state, you need to build it your own.
I often find it helpful to wrap the basic movements into a set of iterators that know how to operate on XMLReader
like iterating through elements or child-elements only. You find this outlined in Parse XML with PHP and XMLReader.
See as well:
- PHP open gzipped XML
Use XML reader instead of XML dom. XML dom stores the whole file in memory which is totally useless:
http://msdn.microsoft.com/en-us/library/system.xml.xmlreader.aspx
When ç
is "รง", then your encoding is Windows-1252 (or maybe ISO-8859-1), but not UTF-8.
First, have you tried ElementTree
(either the built-in pure-Python or C versions, or, better, the lxml
version)? I'm pretty sure none of them actually read the whole file into memory.
The problem, of course, is that, whether or not it reads the whole file into memory, the resulting parsed tree ends up in memory.
ElementTree has a nifty solution that's pretty simple, and often sufficient: iterparse.
for event, elem in ET.iterparse(xmlfile, events=('end')):
...
The key here is that you can modify the tree as it's built up (by replacing the contents with a summary containing only what the parent node will need). By throwing out all the stuff you don't need to keep in memory as it comes in, you can stick to parsing things in the usual order without running out of memory.
The linked page gives more details, including some examples for modifying XML-RPC and plist as they're processed. (In those cases, it's to make the resulting object simpler to use, not to save memory, but they should be enough to get the idea across.)
This only helps if you can think of a way to summarize as you go. (In the most trivial case, where the parent doesn't need any info from its children, this is just elem.clear()
.) Otherwise, this won't work for you.
The standard solution is SAX, which is a callback-based API that lets you operate on the tree a node at a time. You don't need to worry about truncating nodes as you do with iterparse, because the nodes don't exist after you've parsed them.
Most of the best SAX examples out there are for Java or Javascript, but they're not too hard to figure out. For example, if you look at http://cs.au.dk/~amoeller/XML/programming/saxexample.html you should be able to figure out how to write it in Python (as long as you know where to find the documentation for xml.sax).
There are also some DOM-based libraries that work without reading everything into memory, but there aren't any that I know of that I'd trust to handle a 40GB file with reasonable efficiency.
There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).
For an example, you might want to look at this partial parser of the DMOZ-catalog: