How can one parse HTML/XML and extract information from it?
php
xml
parsing
xml-parsing
html-parsing
Answers
3
I would have to say SimpleXML takes the cake because it is firstly an extension, written in C, and is very fast. But second, the parsed document takes the form of a PHP object. So you can "query" like $root->myElement
.
Friday, October 7, 2022
5
$data = explode("n", $data);
$last_line = end($data);
$parts = explode("t", $last_line);
Thursday, November 17, 2022
1
You're on the right track with XMLReader. Rather conveniently it includes the method expand()
which will return a copy of the current node as a DOMNode. This will let you handle each individual Tree in memory with the DOM API.
As for handling nodes - evaluate and descend recursively.
Example:
$data = [
'var1' => 1.05,
'var2' => 0.76
];
$dom = new DOMDocument();
$xpath = new DOMXPath($dom);
$reader = new XMLReader();
$reader->open('forest.xml');
// Read until reaching the first Tree.
while ($reader->read() && $reader->localName !== 'Tree');
while ($reader->localName === 'Tree') {
$tree = $dom->importNode($reader->expand(), true);
echo evaluateTree($data, $tree, $xpath), "n";
// Move on to the next.
$reader->next('Tree');
}
$reader->close();
function evaluateTree(array $data, DOMElement $tree, DOMXPath $xpath)
{
foreach ($xpath->query('./Node', $tree) as $node) {
$field = $xpath->evaluate('string(./SimplePredicate/@field)', $node);
$operator = $xpath->evaluate('string(./SimplePredicate/@operator)', $node);
$value = $xpath->evaluate('string(./SimplePredicate/@value)', $node);
if (evaluatePredicate($data[$field], $operator, $value)) {
// Descend recursively.
return evaluateTree($data, $node, $xpath);
}
}
// Reached the end of the line.
return $tree->getAttribute('id');
}
function evaluatePredicate($left, $operator, $right)
{
switch ($operator) {
case "lessOrEqual":
return $left <= $right;
case "greaterThan":
return $left > $right;
default:
return false;
}
}
Output:
4
Thursday, August 11, 2022
3
Try taking a look at what PHP is outputting from json_decode()
:
$data = $_GET['data'];
$obj = json_decode($data);
var_dump($obj);
Your code itself works fine: http://ideone.com/0jsjgT
But your query string is missing the data=
before the actual JSON. This:
http://mywebsite.com?action=somefunction&{%22id%22:1,%22Name%22:%22Mike%22}
should be this:
http://mywebsite.com?action=somefunction&data={%22id%22:1,%22Name%22:%22Mike%22}
Thursday, October 6, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
php
xml
parsing
xml-parsing
html-parsing
Native XML Extensions
I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.
DOM
DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.
It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then.
A basic usage example can be found in Grabbing the href attribute of an A element and a general conceptual overview can be found at DOMDocument in php
How to use the DOM extension has been covered extensively on , so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing .
XMLReader
XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML Parser Module.
A basic usage example can be found at getting all values from h1 tags using php
XML Parser
The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.
SimpleXml
SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke.
A basic usage example can be found at A simple program to CRUD node and node values of xml file and there is lots of additional examples in the PHP Manual.
3rd Party Libraries (libxml based)
If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxml underneath instead of string parsing.
FluentDom - Repo
HtmlPageDom
phpQuery (not updated for years)
Also see: https://github.com/electrolinux/phpquery
Zend_Dom
QueryPath
fDOMDocument
sabre/xml
FluidXML
3rd-Party (not libxml-based)
The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below
PHP Simple HTML DOM Parser
I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Not all jQuery Selectors (such as child selectors) are possible. Any of the libxml based libraries should outperform this easily.
PHP Html Parser
Again, I would not recommend this parser. It is rather slow with high CPU usage. There is also no function to clear memory of created DOM objects. These problems scale particularly with nested loops. The documentation itself is inaccurate and misspelled, with no responses to fixes since 14 Apr 16.
Ganon
Never used it. Can't tell if it's any good.
HTML 5
You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like
html5lib
We might see more dedicated parsers once HTML5 is finalized. There is also a blogpost by the W3's titled How-To for html 5 parsing that is worth checking out.
WebServices
If you don't feel like programming PHP, you can also use Web services. In general, I found very little utility for these, but that's just me and my use cases.
ScraperWiki.
Regular Expressions
Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.
Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding, or changing attributes in a tag, can make the RegEx fails when it's not properly written. You should know what you are doing before using RegEx on HTML.
HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you write. RegEx are fine in some cases, but it really depends on your use-case.
You can write more reliable parsers, but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.
Also see Parsing Html The Cthulhu Way
Books
If you want to spend some money, have a look at
I am not affiliated with PHP Architect or the authors.