Viewed   94 times

I am using HTML Simple Dom Parser with PHP to get title, description and images from a website. The issue I am facing is I am getting the html which I dont want and how to exclude those html tags. Below is the explanation.

Here is a sample html structure which is being parsed.

<div id="product_description">
<p> Some text</p>
<ul>
<li>value 1</li>
<li>value 2</li>
<li>value 3</li>
</ul>

// the div I dont want
<div id="comments">
<h1> Some Text </h1>
</div>

</div>

I am using below php script to parse,

foreach($html->find('div#product_description') as $description)
{
    echo $description->outertext ;
    echo "<br>";
}

The above code parses everything inside the div with id "product_description". What I want to exclude the div with Id "comments". I tried to convert this into string and then used substr to exclude the last character but thats not working. Dont know why. Any idea about how can I do this? Any approach that will allow me to exclude the div from parsed html will work. Thanks

 Answers

2

You can remove the elements you don't want by setting their outertext = '':

$src =<<<src
<div id="product_description">
    <p> Some text</p>
    <ul>
        <li>value 1</li>
        <li>value 2</li>
        <li>value 3</li>
    </ul>

    <!-- the div I don't want -->                                                                                                                                        
    <div id="comments">
        <h1> Some Text </h1>
    </div>

</div>
src;

$html = str_get_html($src);

foreach($html->find('#product_description') as $description)
{
    $comments = $description->find('#comments', 0); 
    $comments->outertext = ''; 
    print $description->outertext ;
}
Tuesday, December 20, 2022
 
manjula
 
1

did you read the documentation for read and modify attributes As per that

// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $e->href;

// Set a attribute
$e->href = 'ursitename'.$value;
Friday, October 7, 2022
 
kaedys
 
3

Proceed step by step...

Start by getting all the wanted info from one page (the 1st for example)... The idea is to:

  • Get all phone blocks: $phones = $html->find('a[data-id]');
  • In a loop, get the wanted info (name, price) from each block
  • Insert these info in the db (I cant help with db since I didnt use one for a while, but you can do this on your own it's not that hard)

Now that you have the code working for one page, let's try to make it work for all pages knowing that:

  • All pages have the same structure, so we can extract data with the same method/code above
  • The link of the next page to scrape is included in the Next button, so we'll stop when this link cannot be found

So here's a code summarizing all what we said above:

$url = "https://www.varle.lt/mobilieji-telefonai/";

// Start from the main page
$nextLink = $url;

// Loop on each next Link as long as it exsists
while ($nextLink) {
    echo "<hr>nextLink: $nextLink<br>";
    //Create a DOM object
    $html = new simple_html_dom();
    // Load HTML from a url
    $html->load_file($nextLink);

    /////////////////////////////////////////////////////////////
    /// Get phone blocks and extract info (also insert to db) ///
    /////////////////////////////////////////////////////////////
    $phones = $html->find('a[data-id]');

    foreach($phones as $phone) {
        // Get the link
        $linkas = $phone->href;

        // Get the name
        $pavadinimas = $phone->find('span[class=inner]', 0)->plaintext;

        // Get the name price and extract the useful part using regex
        $kaina = $phone->find('span[class=price]', 0)->plaintext;
        // This captures the integer part of decimal numbers: In "123,45" will capture "123"... Use @([d,]+),?@ to capture the decimal part too
        preg_match('@(d+),?@', $kaina, $matches);
        $kaina = $matches[1];

        echo $pavadinimas, " #----# ", $kaina, " #----# ", $linkas, "<br>";

        // INSERT INTO DB HERE
        // CODE
        // ...
    }
    /////////////////////////////////////////////////////////////
    /////////////////////////////////////////////////////////////

    // Extract the next link, if not found return NULL
    $nextLink = ( ($temp = $html->find('div.pagination a[class="next"]', 0)) ? "https://www.varle.lt".$temp->href : NULL );

    // Clear DOM object
    $html->clear();
    unset($html);
}

Output

nextLink: https://www.varle.lt/mobilieji-telefonai/
Samsung Phone I9300 Galaxy SIII Juodas #----# 1099 #----# https://www.varle.lt/mobilieji-telefonai/samsung-phone-i9300-galaxy-siii-juodas.html
Samsung Galaxy S2 Plus I9105 Pilkai m?lynas #----# 739 #----# https://www.varle.lt/mobilieji-telefonai/samsung-galaxy-s2-plus-i9105-pilkai-melynas.html
Samsung Phone S7562 Galaxy S Duos baltas #----# 555 #----# https://www.varle.lt/mobilieji-telefonai/samsung-phone-s7562-galaxy-s-duos-baltas--457135.html
...

nextLink: https://www.varle.lt/mobilieji-telefonai/?p=2
LG T375 Mobile Phone Black #----# 218 #----# https://www.varle.lt/mobilieji-telefonai/lg-t375-mobile-phone-black.html
Samsung S6802 Galaxy Ace Duos black #----# 579 #----# https://www.varle.lt/mobilieji-telefonai/samsung-s6802-galaxy-ace-duos-black.html
Mobilus telefonas Samsung Galaxy Ace Onyx Black | S5830 #----# 559 #----# https://www.varle.lt/mobilieji-telefonai/mobilus-telefonas-samsung-galaxy-ace-onyx-black.html
...

...
...

Working DEMO

Notice that the code may take a while to parse all the pages, so php may return this error Fatal error: Maximum execution time of 30 seconds exceeded .... Then, simply extend the maximum execution time like this:

ini_set('max_execution_time', 300); //300 seconds = 5 minutes
Tuesday, November 1, 2022
4

Two problems, it looks like. First off, I believe what you want is:

$b = $html->find('object', 0);

Per the docs, this is how to find the first instance of the <object> tag.

Your second problem, though, is that the $html does not return any code with <object> tags - the code block you're searching for is not there.

If what you're looking for is the http://95.211.193.83:8777/k6ceg4duoo4pcnokaktshz6a5e6vcg4hy7lzvwxjtd7ddxdeooarqi7uci/v.mp4 value, it's embedded in the <script> tags in the header, so try:

$b = $html->find('script');

Then loop through the array that $b returns until you get what you're looking for.

Sunday, October 16, 2022
5

You should probably use smiplehtmldom for the reason you mentioned and that strip_tags may also leave you non-text elements like javascript or css contained within script/style blocks

You would also be able to filter text from elements that aren't displayed (inline style=display:none)

That said, if the html is simple enough, then strip_tags may be faster and will accomplish the same task

Wednesday, September 28, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :