Viewed   336 times

I'm trying to find a reliable solution to extract a url from a string of characters. I have a site where users answer questions and in the source box, where they enter their source of information, I allow them to enter a url. I want to extract that url and make it a hyperlink. Similar to how Yahoo Answers does it.

Does anyone know a reliable solution that can do this?

All the solutions I have found work for some URL's but not for others.

Thanks

 Answers

5

John Gruber has spent a fair amount of time perfecting the "one regex to rule them all" for link detection. Using preg_replace() as mentioned in the other answers, using the following regex should be one of the most accurate, if not the most accurate, method for detecting a link:

(?i)b((?:[a-z][w-]+:(?:/{1,3}|[a-z0-9%])|wwwd{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^s()<>]+|(([^s()<>]+|(([^s()<>]+)))*))+(?:(([^s()<>]+|(([^s()<>]+)))*)|[^s`!()[]{};:'".,<>?«»“”‘’]))

If you only wanted to match HTTP/HTTPS:

(?i)b((?:https?://|wwwd{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^s()<>]+|(([^s()<>]+|(([^s()<>]+)))*))+(?:(([^s()<>]+|(([^s()<>]+)))*)|[^s`!()[]{};:'".,<>?«»“”‘’]))
Monday, December 26, 2022
1

I have created a very basic set of Regular Expressions for this. Don't expect them to be 100% reliable, and you may need to tweak them as you go.

$pattern = array(
  '/((?:[wd]+://)?(?:[w-d]+.)+[w-d]+(?:/[w-d]+)*(?:/|.[w-d]+)?(?:?[w-d]+=[w-d]+&?)?(?:#[w-d]*)?)/' , # URL
  '/([w-d]+@[w-d]+.[w-d]+)/' , # Email
  '/[@([^]]*)]/' , # Reply
  '/[([^]]*)]/' , # Bold
  '/{([^}]*)}/' , # Italics 
  '/_([^_]*)_/' , # Underline
  '/s{2}/' , # Linebreak
);
$replace = array(
  '<a href="$1">$1</a>' ,
  '<a href="mailto:$1">$1</a>' ,
  '<b class="reply">@$1</b>' ,
  '<b>$1</b>' ,
  '<i>$1</i>' ,
  '<u>$1</u>' ,
  '<br />'
);
$msg = preg_replace( $pattern , $replace , $msg );
return stripslashes( utf8_encode( $msg ) );
Monday, August 8, 2022
 
zisoft
 
2

Let's look at the requirements. You have some user-supplied plain text, which you want to display with hyperlinked URLs.

  1. The "http://" protocol prefix should be optional.
  2. Both domains and IP addresses should be accepted.
  3. Any valid top-level domain should be accepted, e.g. .aero and .xn--jxalpdlp.
  4. Port numbers should be allowed.
  5. URLs must be allowed in normal sentence contexts. For instance, in "Visit .com.", the final period is not part of the URL.
  6. You probably want to allow "https://" URLs as well, and perhaps others as well.
  7. As always when displaying user supplied text in HTML, you want to prevent cross-site scripting (XSS). Also, you'll want ampersands in URLs to be correctly escaped as &amp;.
  8. You probably don't need support for IPv6 addresses.
  9. Edit: As noted in the comments, support for email-adresses is definitely a plus.
  10. Edit: Only plain text input is to be supported – HTML tags in the input should not be honoured. (The Bitbucket version supports HTML input.)

Edit: Check out GitHub for the latest version, with support for email addresses, authenticated URLs, URLs in quotes and parentheses, HTML input, as well as an updated TLD list.

Here's my take:

<?php
$text = <<<EOD
Here are some URLs:
.com/questions/1188129/pregreplace-to-detect-html-php
Here's the answer: http://www.google.com/search?rls=en&q=42&ie=utf-8&oe=utf-8&hl=en. What was the question?
A quick look at http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax is helpful.
There is no place like 127.0.0.1! Except maybe http://news.bbc.co.uk/1/hi/england/surrey/8168892.stm?
Ports: 192.168.0.1:8080, https://example.net:1234/.
Beware of Greeks bringing internationalized top-level domains: xn--hxajbheg2az3al.xn--jxalpdlp.
And remember.Nobody is perfect.

<script>alert('Remember kids: Say no to XSS-attacks! Always HTML escape untrusted input!');</script>
EOD;

$rexProtocol = '(https?://)?';
$rexDomain   = '((?:[-a-zA-Z0-9]{1,63}.)+[-a-zA-Z0-9]{2,63}|(?:[0-9]{1,3}.){3}[0-9]{1,3})';
$rexPort     = '(:[0-9]{1,5})?';
$rexPath     = '(/[!$-/0-9:;=@_':;!a-zA-Zx7f-xff]*?)?';
$rexQuery    = '(?[!$-/0-9:;=@_':;!a-zA-Zx7f-xff]+?)?';
$rexFragment = '(#[!$-/0-9:;=@_':;!a-zA-Zx7f-xff]+?)?';

// Solution 1:

function callback($match)
{
    // Prepend http:// if no protocol specified
    $completeUrl = $match[1] ? $match[0] : "http://{$match[0]}";

    return '<a href="' . $completeUrl . '">'
        . $match[2] . $match[3] . $match[4] . '</a>';
}

print "<pre>";
print preg_replace_callback("&\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:"]?(s|$))&",
    'callback', htmlspecialchars($text));
print "</pre>";
  • To properly escape < and & characters, I throw the whole text through htmlspecialchars before processing. This is not ideal, as the html escaping can cause misdetection of URL boundaries.
  • As demonstrated by the "And remember.Nobody is perfect." line (in which remember.Nobody is treated as an URL, because of the missing space), further checking on valid top-level domains might be in order.

Edit: The following code fixes the above two problems, but is quite a bit more verbose since I'm more or less re-implementing preg_replace_callback using preg_match.

// Solution 2:

$validTlds = array_fill_keys(explode(" ", ".aero .asia .biz .cat .com .coop .edu .gov .info .int .jobs .mil .mobi .museum .name .net .org .pro .tel .travel .ac .ad .ae .af .ag .ai .al .am .an .ao .aq .ar .as .at .au .aw .ax .az .ba .bb .bd .be .bf .bg .bh .bi .bj .bm .bn .bo .br .bs .bt .bv .bw .by .bz .ca .cc .cd .cf .cg .ch .ci .ck .cl .cm .cn .co .cr .cu .cv .cx .cy .cz .de .dj .dk .dm .do .dz .ec .ee .eg .er .es .et .eu .fi .fj .fk .fm .fo .fr .ga .gb .gd .ge .gf .gg .gh .gi .gl .gm .gn .gp .gq .gr .gs .gt .gu .gw .gy .hk .hm .hn .hr .ht .hu .id .ie .il .im .in .io .iq .ir .is .it .je .jm .jo .jp .ke .kg .kh .ki .km .kn .kp .kr .kw .ky .kz .la .lb .lc .li .lk .lr .ls .lt .lu .lv .ly .ma .mc .md .me .mg .mh .mk .ml .mm .mn .mo .mp .mq .mr .ms .mt .mu .mv .mw .mx .my .mz .na .nc .ne .nf .ng .ni .nl .no .np .nr .nu .nz .om .pa .pe .pf .pg .ph .pk .pl .pm .pn .pr .ps .pt .pw .py .qa .re .ro .rs .ru .rw .sa .sb .sc .sd .se .sg .sh .si .sj .sk .sl .sm .sn .so .sr .st .su .sv .sy .sz .tc .td .tf .tg .th .tj .tk .tl .tm .tn .to .tp .tr .tt .tv .tw .tz .ua .ug .uk .us .uy .uz .va .vc .ve .vg .vi .vn .vu .wf .ws .ye .yt .yu .za .zm .zw .xn--0zwm56d .xn--11b5bs3a9aj6g .xn--80akhbyknj4f .xn--9t4b11yi5a .xn--deba0ad .xn--g6w251d .xn--hgbk6aj7f53bba .xn--hlcj6aya9esc7a .xn--jxalpdlp .xn--kgbechtv .xn--zckzah .arpa"), true);

$position = 0;
while (preg_match("{\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:"]?(s|$))}", $text, &$match, PREG_OFFSET_CAPTURE, $position))
{
    list($url, $urlPosition) = $match[0];

    // Print the text leading up to the URL.
    print(htmlspecialchars(substr($text, $position, $urlPosition - $position)));

    $domain = $match[2][0];
    $port   = $match[3][0];
    $path   = $match[4][0];

    // Check if the TLD is valid - or that $domain is an IP address.
    $tld = strtolower(strrchr($domain, '.'));
    if (preg_match('{.[0-9]{1,3}}', $tld) || isset($validTlds[$tld]))
    {
        // Prepend http:// if no protocol specified
        $completeUrl = $match[1][0] ? $url : "http://$url";

        // Print the hyperlink.
        printf('<a href="%s">%s</a>', htmlspecialchars($completeUrl), htmlspecialchars("$domain$port$path"));
    }
    else
    {
        // Not a valid URL.
        print(htmlspecialchars($url));
    }

    // Continue text parsing from after the URL.
    $position = $urlPosition + strlen($url);
}

// Print the remainder of the text.
print(htmlspecialchars(substr($text, $position)));
Thursday, October 20, 2022
3

I just use URI.js -- makes it easy.

var source = "Hello www.example.com,n"
    + "http://google.com is a search engine, like http://www.bing.comn"
    + "http://exämple.org/foo.html?baz=la#bumm is an IDN URL,n"
    + "http://123.123.123.123/foo.html is IPv4 and "
    + "http://fe80:0000:0000:0000:0204:61ff:fe9d:f156/foobar.html is IPv6.n"
    + "links can also be in parens (http://example.org) "
    + "or quotes »http://example.org«.";

var result = URI.withinString(source, function(url) {
    return "<a>" + url + "</a>";
});

/* result is:
Hello <a>www.example.com</a>,
<a>http://google.com</a> is a search engine, like <a>http://www.bing.com</a>
<a>http://exämple.org/foo.html?baz=la#bumm</a> is an IDN URL,
<a>http://123.123.123.123/foo.html</a> is IPv4 and <a>http://fe80:0000:0000:0000:0204:61ff:fe9d:f156/foobar.html</a> is IPv6.
links can also be in parens (<a>http://example.org</a>) or quotes »<a>http://example.org</a>«.
*/
  • https://github.com/medialize/URI.js
  • http://medialize.github.io/URI.js/
Thursday, October 27, 2022
 
kremerd
 
1

Here is a quick and dirty solution using the String API instead of regular expressions which is faster.

Working principle:

  1. all is the whole text you want to search for

  2. s is the start pattern to look for, in this case it will be the first <img ... tag. If you have multiple img, consider iteration or extending the string to possible id="" or class="" tags

  3. ix is the position of the URL in all

  4. the last line gets the String from all starting at ix to the next " it finds

    String all = "<img src="http://www.01net.com/images/article/mea/150.100.790233.jpg""; // shortened it 
    String s = "<img src="";
    int ix = all.indexOf(s)+s.length();
    System.out.println(all.substring(ix, all.indexOf(""", ix+1)));
    

EDIT: A bit more details for advanced readers. As stated in other answers and comments you should not use the String API to parse HTML, as there are many specifics that are hard to catch. Note that regex won't help you either as it is a type-3 Chomsky language (regular) and therefore a subset of HTML which is type-2 (context sensitive, see Wiki). In production use a DOM-parser like jsoup. For quick hacking or known style a String API solution will probably work just fine and add less overhead.

Thursday, December 1, 2022
 
azuken
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :