I need help building a regular expression that can properly match an URL inside free text.

  • scheme
    • One of the following: ftp, http, https (is ftps a protocol?)
  • optional user (and optional pass)
  • host (with support for IDNs)
    • support for www and sub-domain(s) (with support for IDNs)
    • basic filtering of TLDs ([a-zA-Z]{2,6} is enough I think)
  • optional port number
  • path (optional, with support for Unicode chars)
  • query (optional, with support for Unicode chars)
  • fragment (optional, with support for Unicode chars)

Here is what I could find out about sub-domains:

A "subdomain" expresses relative dependence, not absolute dependence: for example, comprises a subdomain of the org domain, and comprises a subdomain of the domain In theory, this subdivision can go down to 127 levels deep, and each DNS label can contain up to 63 characters, as long as the whole domain name does not exceed a total length of 255 characters.

Regarding the domain name itself I couldn't find any reliable source but I think the regular expression for non-IDNs (I'm not sure how to write a IDN compatible version) is something like:


Can someone help me out with this regular expression or point me to a good direction?



John Gruber, of Daring Fireball fame, had a post recently that detailed his quest for a good URL-recognizing regex string. What he came up with was this:


Which apparently does OK with Unicode-containing URLs, as well. You'd need to do the slight modification to it to get the rest of what you're looking for -- the scheme, username, password, etc. Alan Storm wrote a piece explaining Gruber's regex pattern, which I definitely needed (regex is so write-once-have-no-clue-how-to-read-ever-again!).

Saturday, December 3, 2022

There's no need to use a regex for this. PHP has an inbuilt function to do just this. Use parse_url():

$domain = parse_url($url, PHP_URL_HOST);
Tuesday, December 27, 2022

This piece of code:

$domain = '';
$url    = parse_url($domain);
$tokens = explode('.', $url['host']);


Will give you this data:

    [0] => www
    [1] => php
    [2] => net

I believe there is no need for regexs as far as it's very hard to properly parse URL with them. From resulting $tokens array you can extract any part of host name easily.



$url array contains all necessary details:

    [scheme] => http
    [host] =>
    [path] => /index.html
Friday, September 30, 2022

Tuesday, December 20, 2022

Sunday, August 28, 2022
