Viewed   112 times

I have a textfield where users can write anything.

For example:

Lorem Ipsum is simply dummy text. of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Now I would like to parse it and find all YouTube video URLs and their ids.

Any idea how that works?



A YouTube video URL may be encountered in a variety of formats:

  • latest short format:
  • iframe:
  • iframe (secure):
  • object param:
  • object embed:
  • watch:
  • users:
  • ytscreeningroom:
  • any/thing/goes!:
  • any/subdomain/too:
  • more params:
  • query may have dot:
  • nocookie domain:

Here is a PHP function with a commented regex that matches each of these URL forms and converts them to links (if they are not links already):

// Linkify youtube URLs which are not already links.
function linkifyYouTubeURLs($text) {
    $text = preg_replace('~(?#!js YouTubeId Rev:20160125_1800)
        # Match non-linked youtube URL in the wild. (Rev:20130823)
        https?://          # Required scheme. Either http or https.
        (?:[0-9A-Z-]+.)?  # Optional subdomain.
        (?:                # Group host alternatives.
       # Either,
        | youtube          # or or
          (?:-nocookie)?   #
          .com            # followed by
          S*?             # Allow anything up to VIDEO_ID,
          [^ws-]         # but char before ID is non-ID char.
        )                  # End host alternatives.
        ([w-]{11})        # $1: VIDEO_ID is exactly 11 chars.
        (?=[^w-]|$)       # Assert next char is non-ID or EOS.
        (?!                # Assert URL is not pre-linked.
          [?=&+%w.-]*     # Allow URL (query) remainder.
          (?:              # Group pre-linked alternatives.
            ['"][^<>]*>   # Either inside a start tag,
          | </a>           # or inside <a> element text contents.
          )                # End recognized pre-linked alts.
        )                  # End negative lookahead assertion.
        [?=&+%w.-]*       # Consume any URL (query) remainder.
        ~ix', '<a href="$1">YouTube link: $1</a>',
    return $text;

; // End $YouTubeId.

And here is a JavaScript version with the exact same regex (with comments removed):

// Linkify youtube URLs which are not already links.
function linkifyYouTubeURLs(text) {
    var re = /https?://(?:[0-9A-Z-]+.)?(?|youtube(?:-nocookie)?.comS*?[^ws-])([w-]{11})(?=[^w-]|$)(?![?=&+%w.-]*(?:['"][^<>]*>|</a>))[?=&+%w.-]*/ig;
    return text.replace(re,
        '<a href="$1">YouTube link: $1</a>');


  • The VIDEO_ID portion of the URL is captured in the one and only capture group: $1.
  • If you know that your text does not contain any pre-linked URLs, you can safely remove the negative lookahead assertion which tests for this condition (The assertion beginning with the comment: "Assert URL is not pre-linked.") This will speed up the regex somewhat.
  • The replace string can be modified to suit. The one provided above simply creates a link to the generic "" style URL and sets the link text to: "YouTube link: VIDEO_ID".

Edit 2011-07-05: Added - hyphen to ID char class

Edit 2011-07-17: Fixed regex to consume any remaining part (e.g. query) of URL following YouTube ID. Added 'i' ignore-case modifier. Renamed function to camelCase. Improved pre-linked lookahead test.

Edit 2011-07-27: Added new "user" and "ytscreeningroom" formats of YouTube URLs.

Edit 2011-08-02: Simplified/generalized to handle new "any/thing/goes" YouTube URLs.

Edit 2011-08-25: Several modifications:

  • Added a Javascript version of: linkifyYouTubeURLs() function.
  • Previous version had the scheme (HTTP protocol) part optional and thus would match invalid URLs. Made the scheme part required.
  • Previous version used the b word boundary anchor around the VIDEO_ID. However, this will not work if the VIDEO_ID begins or ends with a - dash. Fixed so that it handles this condition.
  • Changed the VIDEO_ID expression so that it must be exactly 11 characters long.
  • The previous version failed to exclude pre-linked URLs if they had a query string following the VIDEO_ID. Improved the negative lookahead assertion to fix this.
  • Added + and % to character class matching query string.
  • Changed PHP version regex delimiter from: % to a: ~.
  • Added a "Notes" section with some handy notes.

Edit 2011-10-12: YouTube URL host part may now have any subdomain (not just www.).

Edit 2012-05-01: The consume URL section may now allow for '-'.

Edit 2013-08-23: Added additional format provided by @Mei. (The query part may have a . dot.

Edit 2013-11-30: Added additional format provided by @CRONUS:

Edit 2016-01-25: Fixed regex to handle error case provided by CRONUS.

Friday, October 21, 2022

You could use preg_match to get the IDs. I will cover the expressions themselves later in this answer, but here is the basic idea of how to use preg_match:

preg_match('expression(video_id)', "", $matches);
$video_id = $matches[1];

Here is a breakdown of the expressions for each type of possible input you asked about. I included a link for each showing some test cases and the results.

  1. For YouTube URLs such as, you could use this expression:

  2. YouTube embed codes can either look like this (some extraneous stuff clipped):

    <object width="640" height="390">
        <param name="movie" value="

    Or like this:

    <iframe ... src="" ... </iframe>

    So an expression to get the ID from either style would be this:

  3. Vimeo URLs look like<integer>, as far as I can tell. The lowest I found was simply, and I don't know if there's an upper limit, but I'll assume for now that it's limited to 10 digits. Hopefully someone can correct me if they are aware of the details. This expression could be used:[0-9]{1,10})
  4. Vimeo embed code takes this form:

    <iframe src="<integer>" width="400" ...

    So you could use this expression:[0-9]{1,10})

    Alternately, if the length of the numbers may eventually exceed 10, you could use:[0-9]*)"

    Bear in mind that the " will need to be escaped with a if you are enclosing the expression in double quotes.

In summary, I'm not sure how you wanted to implement this, but you could either combine all expressions with |, or you could match each one separately. Add a comment to this answer if you want me to provide further details on how to combine the expressions.

Saturday, December 24, 2022
Sunday, September 4, 2022

Note that this will allow any Unicode digit, not just 0-9. You might prefer:

char c = string.charAt(0);
isDigit = (c >= '0' && c <= '9');

Or the slower regex solutions:

s.substring(0, 1).matches("\d")
// or the equivalent
s.substring(0, 1).matches("[0-9]")

However, with any of these methods, you must first be sure that the string isn't empty. If it is, charAt(0) and substring(0, 1) will throw a StringIndexOutOfBoundsException. startsWith does not have this problem.

To make the entire condition one line and avoid length checks, you can alter the regexes to the following:

// or the equivalent

If the condition does not appear in a tight loop in your program, the small performance hit for using regular expressions is not likely to be noticeable.

Saturday, August 6, 2022

This will recursively traverse the /path/to/folder directory and list only the symbolic links:

ls -lR /path/to/folder | grep ^l

If your intention is to follow the symbolic links too, you should use your find command but you should include the -L option; in fact the find man page says:

   -L     Follow symbolic links.  When find examines or prints information
          about files, the information used shall be taken from the  prop‐
          erties  of  the file to which the link points, not from the link
          itself (unless it is a broken symbolic link or find is unable to
          examine  the file to which the link points).  Use of this option
          implies -noleaf.  If you later use the -P option,  -noleaf  will
          still  be  in  effect.   If -L is in effect and find discovers a
          symbolic link to a subdirectory during its search, the subdirec‐
          tory pointed to by the symbolic link will be searched.

          When the -L option is in effect, the -type predicate will always
          match against the type of the file that a symbolic  link  points
          to rather than the link itself (unless the symbolic link is bro‐
          ken).  Using -L causes the -lname and -ilname predicates  always
          to return false.

Then try this:

find -L /var/www/ -type l

This will probably work: I found in the find man page this diamond: if you are using the -type option you have to change it to the -xtype option:

          l      symbolic link; this is never true if the -L option or the
                 -follow option is in effect, unless the symbolic link  is
                 broken.  If you want to search for symbolic links when -L
                 is in effect, use -xtype.


find -L /var/www/ -xtype l
Monday, September 12, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :