Viewed   61 times

I need help building a regular expression that can properly match an URL inside free text.

  • scheme
    • One of the following: ftp, http, https (is ftps a protocol?)
  • optional user (and optional pass)
  • host (with support for IDNs)
    • support for www and sub-domain(s) (with support for IDNs)
    • basic filtering of TLDs ([a-zA-Z]{2,6} is enough I think)
  • optional port number
  • path (optional, with support for Unicode chars)
  • query (optional, with support for Unicode chars)
  • fragment (optional, with support for Unicode chars)

Here is what I could find out about sub-domains:

A "subdomain" expresses relative dependence, not absolute dependence: for example, wikipedia.org comprises a subdomain of the org domain, and en.wikipedia.org comprises a subdomain of the domain wikipedia.org. In theory, this subdivision can go down to 127 levels deep, and each DNS label can contain up to 63 characters, as long as the whole domain name does not exceed a total length of 255 characters.

Regarding the domain name itself I couldn't find any reliable source but I think the regular expression for non-IDNs (I'm not sure how to write a IDN compatible version) is something like:

[0-9a-zA-Z][0-9a-zA-Z-]{2,62}

Can someone help me out with this regular expression or point me to a good direction?

 Answers

2

John Gruber, of Daring Fireball fame, had a post recently that detailed his quest for a good URL-recognizing regex string. What he came up with was this:

b(([w-]+://?|www[.])[^s()<>]+(?:([wd]+)|([^[:punct:]s]|/)))

Which apparently does OK with Unicode-containing URLs, as well. You'd need to do the slight modification to it to get the rest of what you're looking for -- the scheme, username, password, etc. Alan Storm wrote a piece explaining Gruber's regex pattern, which I definitely needed (regex is so write-once-have-no-clue-how-to-read-ever-again!).

Saturday, December 3, 2022
3

There's no need to use a regex for this. PHP has an inbuilt function to do just this. Use parse_url():

$domain = parse_url($url, PHP_URL_HOST);
Tuesday, December 27, 2022
 
lher
 
3

This piece of code:

$domain = 'http://www.php.net/index.html';
$url    = parse_url($domain);
$tokens = explode('.', $url['host']);

print_r($tokens);

Will give you this data:

Array
(
    [0] => www
    [1] => php
    [2] => net
)

I believe there is no need for regexs as far as it's very hard to properly parse URL with them. From resulting $tokens array you can extract any part of host name easily.

Update:

print_r($url);

$url array contains all necessary details:

Array
(
    [scheme] => http
    [host] => www.php.net
    [path] => /index.html
)
Friday, September 30, 2022
 
5

The binding doesn't work because "the array is mutating, but the property itself is not changing". https://.com/a/10355003/603636

Using App.initialize and Ember.Router, views and controllers are now being automagically connected. There is very little reason to manually bind contacts in your view to the controller's content as you already have access to it.

Change your view's template to include:

{{#if controller.isLoaded}} // set this true in your ajax success function
  {{#each contact in controller}}
    {{view App.ContactListItemView contactBinding="contact"}}
  {/each}}
{{else}}
  <tr>
    <td colspan="2">
       You have no contacts <br />
       :( 
    <td>
  </tr>
{{/if}}
Tuesday, December 20, 2022
 
scgough
 
1

This is a very tricky question here.

First: What are the resolutions for rotation/resizing? If you have sufficient pixels to avoid aliasing effects, then you might be ok, but if one or the other representation of the sign is very small (ie, it's small in the collage or small in the sample shot you have), rotations to an arbitrary angle could be bad.

Also, are you sure you don't have shearing or other kinds of effects? I'm assuming a purely 2D rotation, where the axis of rotation runs through the center of the camera (ie, a stop sign will just be an octagon, rotated, not a sheared octagon).

One thing you can try, if you have the patience and the sample data, is to implement Viola and Jones' face matching algorithm, but for the sign. Basically, you need a bunch of training data, where you have masked out the pixels that are interesting to you from the background/pixels that are not. Then, the algorithm is to randomly select pixels from that training data ('examples') and for each example, calculate a few hundred to a few thousand statistics ('features'). A feature can be anything from the current pixel intensity in the red channel to the summed intensity of a 5x5 neighborhood in the blue channel. Then, you build a histogram for each pixel and try to find features that have foreground pixels separated from background pixels on the histogram (ie, foreground is all on the left of the histogram, background the right). You then choose the best features for the job, and run them to find the sign in the collage.

That is a brief summary of a friend of mine's dissertation research. This kind of problem is hard to solve easily, and easy to make a bad solution for.

If you just have one sign and one collage and only want to have one solution, you can basically convolve the sign with the collage. Take the FFT of each one, pad the smaller image with zeroes so that it's the same size as the larger, then do a point-by-point multiplication. Then, perform an inverse fft on the result. You should see a spike in the location of the sign in the collage, depending on the severity of rotation and scaling (if you believe that they are very different, then you might need to experiment with a variety of different scaling and rotation techniques).

This second approach is easily done in matlab; otherwise, you'll need a library like the fftw to pull it off.

Sunday, August 28, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :