Viewed   115 times

I'm teaching myself some basic scraping and I've found that sometimes the URL's that I feed into my code return 404, which gums up all the rest of my code.

So I need a test at the top of the code to check if the URL returns 404 or not.

This would seem like a pretty straightfoward task, but Google's not giving me any answers. I worry I'm searching for the wrong stuff.

One blog recommended I use this:

$valid = @fsockopen($url, 80, $errno, $errstr, 30);

and then test to see if $valid if empty or not.

But I think the URL that's giving me problems has a redirect on it, so $valid is coming up empty for all values. Or perhaps I'm doing something else wrong.

I've also looked into a "head request" but I've yet to find any actual code examples I can play with or try out.

Suggestions? And what's this about curl?

 Answers

4

If you are using PHP's curl bindings, you can check the error code using curl_getinfo as such:

$handle = curl_init($url);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);

/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);

/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 404) {
    /* Handle 404 here. */
}

curl_close($handle);

/* Handle $response here. */
Wednesday, November 2, 2022
3

Your problem has nothing to do with what you include: It's that your page is too small.

In my experience, Chrome's built-in "Oops" page is displayed, like the one in Internet Explorer, when the page emits only the 404 header and less than a defined number of bytes of content (I think it's 512 bytes in IE, don't know the limit in Chrome).

I tend to pad my 404 pages with a few hundred bytes of meaningless content wrapped in HTML comments to make sure the custom 404 page is displayed.

Or of course, use the opportunity for some cool ASCII Art!

   <!--                            oooo   ooo
                                   $   $  $   $
                                   "o  $ $  o""
                                     o  "   "ooooo
                                 oo ""           o$
                   o            o            oo  "
                  $$             $o$""$o  ooo$
                  $"$          o $    "$  $
        o$o       $ "$         $ $     $ $
         $$$o     $$ "$       o$ $     o $o
         "$ ""o   "$   "o     $$ "o     o"
     $o   $$   "o  $     "oo  $"  $   o$"
      "$   $o    "o$$       "o$    $o$" oo$
        "o "$o                     "$o $"$$
          "        oo$$$$$$$oo        $oo$$""      o" o
 """""""""      o$$$$$$$$$$$"$o             o"""$o$$  o$
       ooo$$$"o$$$$$$$$$$$$$$ "$o    o   o$$$o   $ $ o$
    o$$$$$$$$$$$$$$$$$$$$$$$$    "oo  o      ""o  "$ $
   $$$$$$$$$$$$$$$$$$$$$$$$$$      "$o   o$$"""$     " oo""o
o""""$$$$$$$$$$$$$$$$$$$$$$"         ""$o"$o          "   o$
     "$$$$$$$$""""$""$$$""              "$oo$""$o     o$"""
      $$$$$$$"                           $""""$"  o""""
       $"""""$ooooo        ooooo$$$$$$$     o$" o"
        $     """" oooo$$$$$$$$$$$$$$"     $"  o"
      oo$   oooo$$$$$$$$$$$"""""$$$$"    o$" o$"
    "$ $o$$$$$$$$$$$$$""$     o$$$"oooo$"  o"
      "o$ "$$$$$$$$$$$$         $$o$"$$$   $"
        "$  ""$$$$$$$$$        o$"$$$ "$$o$$
          "o   ""$$$$$$o     o$$$$ ""$o """$
            "$o    ""$$$$$$o"  o$$$$oo o$$$$
               ""$oo     $$" "$$$"" ooooooo$
                    """"$"  o$"   oo$$$$$""$$
                       $ oo$"  o$$$$$""  ooo$
                       $o$"  o$$$$"  oo$$$$$$$o
                        $$ o$$$"  o$$$$$$"""""$o
                         "o$$"  o$$$$""  o$$$$$$$o
                           "$oo$$$$"  o$$$$$""" o$o
                             "$$$" oo$$$"" oo$$$$$$$
                          ooooo$oo$$$"" oo$$$$"""$$""
                         $"oooo $$$" o$$$$""      $
                       o$"o$   $$"oo$$""       " o$
                       $ o$$o  $$o$$"          oo$$
                       $ $$$$  $$$$$$$$$$$$$$$$$$$$
                       $ $$$$  $$$$$$$$$$$$$$$$$$$$"
                       $ $$$$  $$$$$$$$$$$$$$$$$$$$
                       $ ""    ""$$$$$$$$$"""$""""
                       $o         $"$"    " $"
                        $o       $$  $o    o$
                         "$o   o$$    ""$$$"
                           """"""  -->
Thursday, December 22, 2022
 
keysl
 
3

There is a library that's proposed for Boost inclusion and allows you to parse HTTP URI's easily. It uses Boost.Spirit and is also released under the Boost Software License. The library is cpp-netlib which you can find the documentation for at http://cpp-netlib.github.com/ -- you can download the latest release from http://github.com/cpp-netlib/cpp-netlib/downloads .

The relevant type you'll want to use is boost::network::http::uri and is documented here.

Wednesday, October 5, 2022
 
1

Instead of cracking my head over a regex (URLs are very complicated), I just use filter_var(), and then attempt to ping the URL using cURL:

if (filter_var($url, FILTER_VALIDATE_URL) !== false)
{
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_NOBODY, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 30);
    curl_exec($ch);
    $status_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($status_code >= 200 && $status_code < 400)
    {
        echo 'URL is valid!';
    }
}
Tuesday, August 9, 2022
 
myx
 
myx
2

The header is not what tells Apache to display it's 404 page. Rather, when Apache displays its 404 page, it sends a 404 header along with it. The header is meant to have meaning to the browser, not the server. Apache displays a 404 when it can't find the proper file to display. Since you're in a PHP script, Apache has already found a file it can display, and thus won't show its own 404 page.

Tuesday, August 23, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :