Viewed   98 times

How should parse with PHP (simple html dom/etc..) background and other images of webpage?

case 1: inline css

<div id="id100" style="background:url(/mycar1.jpg)"></div>

case 2: css inside html page

<div id="id100"></div>

<style type="text/css">
#id100{
background:url(/mycar1.jpg);
}
</style>

case 3: separate css file

<div id="id100" style="background:url(/mycar1.jpg);"></div>

external.css

#id100{
background:url(/mycar1.jpg);
}

case 4: image inside img tag

solution to case 4 as he appears in php simple html dom parser:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

Please help me to parse case 1,2,3.

If exist more cases please write them, with soltion if you can please.

Thanks

 Answers

5

For Case 1:

// Create DOM from URL or file 
$html = file_get_html('http://www.google.com/');

// Get the style attribute for the item
$style = $html->getElementById("id100")->getAttribute('style');

// $style = background:url(/mycar1.jpg)
// You would now need to put it into a css parser or do some regular expression magic to get the values you need.

For Case 2/3:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Get the Style element
$style = $html->find('head',0)->find('style');

// $style now contains an array of style elements within the head. You will need to work out using attribute selectors what whether an element has a src attribute, if it does download the external css file and parse (using a css parser), if it doesnt then pass the innertext to the css parser.
Wednesday, December 7, 2022
5
<?php

ini_set('display_errors', 1);
include_once('simple_html_dom.php');

$html = file_get_html('http://www.dsebd.org/dseX_share.php');
$table = $html->find('table', 5);

$rowData = array();

foreach ($table->find('tr') as $row)
{
    foreach ($row->find('td') as $cell)
    {
        $flight = array();
        $cell->plaintext=  preg_replace('/</[w]+>/',"",$cell->plaintext);
        $flight[] = array_map('trim', preg_split("@(&nbsp;)+@", $cell->plaintext));
    }
    $rowData=array_merge($rowData,$flight);
}
print '<pre>';
print_r($rowData);
print '</pre>';
?>
Friday, November 25, 2022
2

You can remove the elements you don't want by setting their outertext = '':

$src =<<<src
<div id="product_description">
    <p> Some text</p>
    <ul>
        <li>value 1</li>
        <li>value 2</li>
        <li>value 3</li>
    </ul>

    <!-- the div I don't want -->                                                                                                                                        
    <div id="comments">
        <h1> Some Text </h1>
    </div>

</div>
src;

$html = str_get_html($src);

foreach($html->find('#product_description') as $description)
{
    $comments = $description->find('#comments', 0); 
    $comments->outertext = ''; 
    print $description->outertext ;
}
Tuesday, December 20, 2022
 
manjula
 
4

Did you try it? Try this example (Sample: adding data tags).

include 'simple_html_dom.php';

$html_string = '
<style>.myelems{color:green}</style>
<div>
    <p class="myelems">text inside 1</p>
    <p class="myelems">text inside 2</p>
    <p class="myelems">text inside 3</p>
    <p>simple text 1</p>
    <p>simple text 2</p>
</div>
';

$html = str_get_html($html_string);
foreach($html->find('div p[class="myelems"]') as $key => $p_tags) {
    $p_tags->{'data-index'} = $key;
}

echo htmlentities($html);

Output:

<style>.myelems{color:green}</style> 
<div> 
    <p class="myelems" data-index="0">text inside 1</p> 
    <p class="myelems" data-index="1">text inside 2</p> 
    <p class="myelems" data-index="2">text inside 3</p> 
    <p>simple text 1</p> 
    <p>simple text 2</p> 
</div>
Saturday, November 12, 2022
3

Proceed step by step...

Start by getting all the wanted info from one page (the 1st for example)... The idea is to:

  • Get all phone blocks: $phones = $html->find('a[data-id]');
  • In a loop, get the wanted info (name, price) from each block
  • Insert these info in the db (I cant help with db since I didnt use one for a while, but you can do this on your own it's not that hard)

Now that you have the code working for one page, let's try to make it work for all pages knowing that:

  • All pages have the same structure, so we can extract data with the same method/code above
  • The link of the next page to scrape is included in the Next button, so we'll stop when this link cannot be found

So here's a code summarizing all what we said above:

$url = "https://www.varle.lt/mobilieji-telefonai/";

// Start from the main page
$nextLink = $url;

// Loop on each next Link as long as it exsists
while ($nextLink) {
    echo "<hr>nextLink: $nextLink<br>";
    //Create a DOM object
    $html = new simple_html_dom();
    // Load HTML from a url
    $html->load_file($nextLink);

    /////////////////////////////////////////////////////////////
    /// Get phone blocks and extract info (also insert to db) ///
    /////////////////////////////////////////////////////////////
    $phones = $html->find('a[data-id]');

    foreach($phones as $phone) {
        // Get the link
        $linkas = $phone->href;

        // Get the name
        $pavadinimas = $phone->find('span[class=inner]', 0)->plaintext;

        // Get the name price and extract the useful part using regex
        $kaina = $phone->find('span[class=price]', 0)->plaintext;
        // This captures the integer part of decimal numbers: In "123,45" will capture "123"... Use @([d,]+),?@ to capture the decimal part too
        preg_match('@(d+),?@', $kaina, $matches);
        $kaina = $matches[1];

        echo $pavadinimas, " #----# ", $kaina, " #----# ", $linkas, "<br>";

        // INSERT INTO DB HERE
        // CODE
        // ...
    }
    /////////////////////////////////////////////////////////////
    /////////////////////////////////////////////////////////////

    // Extract the next link, if not found return NULL
    $nextLink = ( ($temp = $html->find('div.pagination a[class="next"]', 0)) ? "https://www.varle.lt".$temp->href : NULL );

    // Clear DOM object
    $html->clear();
    unset($html);
}

Output

nextLink: https://www.varle.lt/mobilieji-telefonai/
Samsung Phone I9300 Galaxy SIII Juodas #----# 1099 #----# https://www.varle.lt/mobilieji-telefonai/samsung-phone-i9300-galaxy-siii-juodas.html
Samsung Galaxy S2 Plus I9105 Pilkai m?lynas #----# 739 #----# https://www.varle.lt/mobilieji-telefonai/samsung-galaxy-s2-plus-i9105-pilkai-melynas.html
Samsung Phone S7562 Galaxy S Duos baltas #----# 555 #----# https://www.varle.lt/mobilieji-telefonai/samsung-phone-s7562-galaxy-s-duos-baltas--457135.html
...

nextLink: https://www.varle.lt/mobilieji-telefonai/?p=2
LG T375 Mobile Phone Black #----# 218 #----# https://www.varle.lt/mobilieji-telefonai/lg-t375-mobile-phone-black.html
Samsung S6802 Galaxy Ace Duos black #----# 579 #----# https://www.varle.lt/mobilieji-telefonai/samsung-s6802-galaxy-ace-duos-black.html
Mobilus telefonas Samsung Galaxy Ace Onyx Black | S5830 #----# 559 #----# https://www.varle.lt/mobilieji-telefonai/mobilus-telefonas-samsung-galaxy-ace-onyx-black.html
...

...
...

Working DEMO

Notice that the code may take a while to parse all the pages, so php may return this error Fatal error: Maximum execution time of 30 seconds exceeded .... Then, simply extend the maximum execution time like this:

ini_set('max_execution_time', 300); //300 seconds = 5 minutes
Tuesday, November 1, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :