Viewed   55 times

I'm new to DOM parsing in PHP:
I have a HTML file that I'm trying to parse. It has a bunch of DIVs like this:

<div id="interestingbox"> 
   <div id="interestingdetails" class="txtnormal">
        <div>Content1</div>
        <div>Content2</div>
   </div>
</div>

<div id="interestingbox"> 
......

I'm trying to get the contents of the many div boxes using php. How can I use the DOM parser to do this?

Thanks!

 Answers

1

First i have to tell you that you can't use the same id on two different divs; there are classes for that point. Every element should have an unique id.

Code to get the contents of the div with id="interestingbox"

$html = '
<html>
<head></head>
<body>
<div id="interestingbox"> 
   <div id="interestingdetails" class="txtnormal">
        <div>Content1</div>
        <div>Content2</div>
   </div>
</div>

<div id="interestingbox2"><a href="#">a link</a></div>
</body>
</html>';


$dom_document = new DOMDocument();

$dom_document->loadHTML($html);

//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom_document);

// if you want to get the div with id=interestingbox
$elements = $dom_xpath->query("*/div[@id='interestingbox']");

if (!is_null($elements)) {

  foreach ($elements as $element) {
    echo "n[". $element->nodeName. "]";

    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
      echo $node->nodeValue. "n";
    }

  }
}

//OUTPUT
[div]  {
        Content1
        Content2
}

Example with classes:

$html = '
<html>
<head></head>
<body>
<div class="interestingbox"> 
   <div id="interestingdetails" class="txtnormal">
        <div>Content1</div>
        <div>Content2</div>
   </div>
</div>

<div class="interestingbox"><a href="#">a link</a></div>
</body>
</html>';

//the same as before.. just change the xpath

[...]

$elements = $dom_xpath->query("*/div[@class='interestingbox']");

[...]

//OUTPUT
[div]  {
        Content1
        Content2
}

[div]  {
a link
}

Refer to the DOMXPath page for more details.

Sunday, September 4, 2022
2

You should check td has a child. Select anchor tag using getElementsByTagName() and check the selection has content using length property. If the td has anchor in child, use getAttribute() to get href attribute of it.

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('td') as $node) {
    $nodeAnchor = $node->getElementsByTagName("a");
    if ($nodeAnchor->length)
        $array_data[] = $nodeAnchor->item(0)->getAttribute("href");
    $array_data[] = $node->nodeValue;
}

See demo

Sunday, September 11, 2022
 
1

This should work for the particular instance you have highlighted, it only does the POST instance, but that should be easy to expand.

The main part is when you see the AST for the code, try and make sure you can identify the base of the _POST access. This turns out to be a NodeExprArrayDimFetch, then inside this you want to check if the variable it is using is _POST.

Once you have identified this, you can replace that node with a new one which is just a string NodeScalarString_("Hello World!");.

$traverser->addVisitor(new class extends NodeVisitorAbstract {
    public function leaveNode(Node $node) {
        if ($node instanceof NodeExprArrayDimFetch
            && $node->var instanceof NodeExprVariable
            && $node->var->name === '_POST'
            ) {
                // Change the $_POST['firstname'] and replace it with XXX value
                return new NodeScalarString_("Hello World!");
            }
    }

});

$prettyPrinter = new PhpParserPrettyPrinterStandard;
echo $prettyPrinter->prettyPrintFile($traverser->traverse($ast));

From your original code of

<?php
$firstname=  $_POST['firstname'];

this code outputs....

<?php

$firstname = 'Hello World!';
Sunday, December 25, 2022
 
bruce_p
 
4

It is possible with a simple autoloader and it is not so hard to do it:

function __autoload($className)
{
    $className = ltrim($className, '\');
    $fileName  = '';
    $namespace = '';
    if ($lastNsPos = strripos($className, '\')) {
        $namespace = substr($className, 0, $lastNsPos);
        $className = substr($className, $lastNsPos + 1);
        $fileName  = str_replace('\', DIRECTORY_SEPARATOR, $namespace) . DIRECTORY_SEPARATOR;
    }
    $fileName .= str_replace('_', DIRECTORY_SEPARATOR, $className) . '.php';
    // $fileName .= $className . '.php'; //sometimes you need a custom structure
    //require_once "library/class.php"; //or include a class manually
    require $fileName;

}

But sometimes you have to adjust the $fileName so it works with all libraries. It depends on the standard for autoloading and how the class names of the libraries are named. Sometimes you have to split the classname on _ and use the first element for the direcotry name and add this also to the class name. I had for example a second library with a class like Library_Parser but the structure was Library/library-parser.php.

The first library worked directly with the above code and all classes were automatically loaded.

The code was taken from http://www.sitepoint.com/autoloading-and-the-psr-0-standard/ but I had to correct some code parts (additional underscores and backslashes). I have used the PSR-0 Standard solution.

PSR-4 version by https://.com/users/1740659/thibault:

function loadPackage($dir)
{
    $composer = json_decode(file_get_contents("$dir/composer.json"), 1);
    $namespaces = $composer['autoload']['psr-4'];

    // Foreach namespace specified in the composer, load the given classes
    foreach ($namespaces as $namespace => $classpaths) {
        if (!is_array($classpaths)) {
            $classpaths = array($classpaths);
        }
        spl_autoload_register(function ($classname) use ($namespace, $classpaths, $dir) {
            // Check if the namespace matches the class we are looking for
            if (preg_match("#^".preg_quote($namespace)."#", $classname)) {
                // Remove the namespace from the file path since it's psr4
                $classname = str_replace($namespace, "", $classname);
                $filename = preg_replace("#\\#", "/", $classname).".php";
                foreach ($classpaths as $classpath) {
                    $fullpath = $dir."/".$classpath."/$filename";
                    if (file_exists($fullpath)) {
                        include_once $fullpath;
                    }
                }
            }
        });
    }
}

loadPackage(__DIR__."/vendor/project");

new CompanyNamePackageNameTest();
Friday, September 9, 2022
 
qrystal
 
2

You can define a custom function DOMinnerHTML() (described here) to retrieve an element's inner HTML, rather than its text content. It works by temorarlily creating a new document:

<?php 
function DOMinnerHTML($element) 
{ 
    $innerHTML = ""; 
    $children = $element->childNodes; 
    foreach ($children as $child) 
    { 
        $tmp_dom = new DOMDocument(); 
        $tmp_dom->appendChild($tmp_dom->importNode($child, true)); 
        $innerHTML.=trim($tmp_dom->saveHTML()); 
    } 
    return $innerHTML; 
} 
?> 

Example usage:

$doc = new DOMDocument();
$doc -> loadHTML($page);
$divs = $doc->getElementsByTagName('div');
foreach($divs as $div) {
    if ($div->getAttribute('class') === 'text_container') {
        $innerHtml = DOMinnerHTML($div);
        echo '<div>' . $innerHtml . '</div>';
    }
}
Sunday, September 4, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :