Viewed   202 times

I am new to scraping and have scrapped two websites formally. But the problem appeared to me when I tried to scrape dynamic loading websites. When the website is rendered with JavaScript, I am unable to scrape the contents of the website then.

Is there any way I can scrape the contents of that website using php curl or any other client related to PHP?

This is what I have done so far :

$link = "https://www.glassdoor.com/Job/jobs.htm?suggestCount=0&suggestChosen=false&clickSource=searchBtn&typedKeyword=android+developer&sc.keyword=android+developer&locT=N&locId=192&jobType=";

$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch,CURLOPT_URL,$link);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13");
$data = curl_exec($ch);


$document = new DOMdocument();
libxml_use_internal_errors(true);
$document->loadHTML($data);
$elements = $document->getElementsByTagName("div");

foreach($elements as $element){
  	echo $element->nodeValue."<br>";;
}

 Answers

5

You need headless browser for this, you can use PHP Wrapper for PhantomJS , here is the link http://jonnnnyw.github.io/php-phantomjs/. This will solve your problem. It has following features:

  • Load webpages through the PhantomJS headless browser
  • View detailed response data including page content, headers, status code etc.
  • Handle redirects
  • View javascript console errors

Hope this helps.

Saturday, September 24, 2022
1

What version of PHP are you using? In PHP 5.5 the curl option CURLOPT_SAFE_UPLOAD was introduced which startet defaulting to true as of PHP 5.6.0. When it is true file uploads using @/path/to/file are disabled. So, if you are using PHP 5.6 or newer you have to set it to false to allow the upload:

curl_setopt($request, CURLOPT_SAFE_UPLOAD, false);

But the @/path/to/file format for uploads is outdated and deprecated as of PHP 5.5.0, you should use the CurlFile class for this now:

$request = curl_init();
$file_path = $path.$name;
curl_setopt($request, CURLOPT_URL, 'http://localhost/pushUploadedFile.php');
curl_setopt($request, CURLOPT_POST, true);
curl_setopt(
     $request,
     CURLOPT_POSTFIELDS,
     array(
      'file' => new CurlFile( $file_path ),
      'test' => 'rahul'
));
curl_setopt($request, CURLOPT_RETURNTRANSFER, true);
echo curl_exec($request);
Wednesday, October 26, 2022
 
4

this is what you want: remove CURLOPT_POSTFIELDS altogether, and replace CURLOPT_CUSTOMREQUEST=>'PUT' with CURLOPT_UPLOAD=>1 and replace 'r' with 'rb', and use CURLOPT_INFILE (you're supposed to use INFILE instead of POSTFIELDS),

$fp = fopen($_FILES['file']['tmp_name'], "rb");
curl_setopt_array($ch,array(
CURLOPT_UPLOAD=>1,
CURLOPT_INFILE=>$fp,
CURLOPT_INFILESIZE=>$_FILES['file']['size']
));

This works when I had $file = fopen($temp_name, 'r');

never use the r mode, always use the rb mode (short for "binary mode"), weird things happen if you ever use the r mode on Windows, r is short for "text mode" - if you actually want text mode, use rt (and unless you really know what you're doing, you don't want the text mode, ever, unfortunate that it's the default mode),

but the file uploaded was a weird file. (...) This is what the file looks like when the person at the other end of this API tries to open it.

well you gave CURLOPT_POSTFIELDS a resource. CURLOPT_POSTFIELDS accepts 2 kinds of arguments, #1: an array (for multipart/form-data requests), #2: a string (for when you want to specify the raw post body data), it does not accept resources.

if the php curl api was well designed, you would get an InvalidArgumentException, or a TypeError, when giving CURLOPT_POSTFIELDS a resource. but it's not well designed. instead, what happened is that curl_setopt implicitly casted your resource to a string, hence resource id #X , it's the same as doing

curl_setopt($request, CURLOPT_POSTFIELDS, (string) fopen(...));
Thursday, August 25, 2022
 
5

Build an array structure from the XML using DOM+Xpath and serialize it to JSON. Xpath expressions allow you to fetch node list and scalar values from the DOM using location paths and conditions.

For example:

  • any Account
    /eExact/Accounts/Account
  • that has the status C
    /eExact/Accounts/Account[@status="C"]
  • just the last node
    (/eExact/Accounts/Account[@status="C"])[last()]

The / at the start of the location path anchors it to the document otherwise the expression will use the current context (the second argument of DOMXpath::evaluate()).

For example:

  • any Email child element
    Email
  • cast first found Email to string
    string(Email)

Demo:

$document = new DOMDocument();
$document->loadXML(getXML());
$xpath = new DOMxpath($document);

$json = [];
foreach ($xpath->evaluate('(/eExact/Accounts/Account[@status="C"])[last()]') as $account) {
    $json['email'] = $xpath->evaluate('string(Email)', $account);
    $json['username'] = $xpath->evaluate('string(Name)', $account);
    $json['first_name'] = $xpath->evaluate('string(Contact[@default="1"]/FirstName)', $account);
    $json['last_name'] = $xpath->evaluate('string(Contact[@default="1"]/LastName)', $account);
    $json['billing'] = [
      'first_name' => $xpath->evaluate('string(Contact[@default="1"]/FirstName)', $account),
      'last_name' => $xpath->evaluate('string(Contact[@default="1"]/LastName)', $account),
      'postcode'=> $xpath->evaluate('string(Address[@default="1"]/PostalCode)', $account),
    ];
    // ...
}
echo json_encode($json, JSON_PRETTY_PRINT);

Output:

{
    "email": "example@example.com",
    "username": "BOB",
    "first_name": "BOB",
    "last_name": "BOB",
    "billing": {
        "first_name": "BOB",
        "last_name": "BOB",
        "postcode": "1000000"
    }
}
Wednesday, August 17, 2022
 
4

Some notes:

  • Running php -i is good. It shows you the php.ini used, so that you know which file to edit.
  • Running curl -v is not needed, because that's the standalone curl for usage on the CLI and unrelated to the PHP Extension curl.
  • You checked for php5-curl, that's the needed package. Ok.

What's missing? You need to make sure the extension is also loaded by PHP!

Edit your /etc/php/5.6/cli/php.ini, search for extension, look for php_curl and enable it: extension=php_curl.so.

Then run php -m on the CLI to see the list of loaded modules and ensure that curl is loaded.

Finally, re-run your composer install.

Friday, November 25, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :