Viewed   110 times

I've run into a bit of a problem with a Regex I'm using for humans names.

$rexName = '/^[a-z' -]$/i';

Suppose a user with the name Jürgen wishes to register? Or Böb? That's pretty commonplace in Europe. Is there a special notation for this?

EDIT:, just threw the Jürgen name against a regex creator, and it splits the word up at the ü letter...

http://www.txt2re.com/index.php3?s=J%FCrgen+Blalock&submit=Show+Matches

EDIT2: Allright, since checking for such specific things is hard, why not use a regex that simply checks for illegal characters?

$rexSafety = "/^[^<,"@/{}()*$%?=>:|;#]*$/i";

(now which ones of these can actually be used in any hacking attempt?)

For instance. This allows ' and - signs, yet you need a ; to make it work in SQL, and those will be stopped.Any other characters that are commonly used for HTML injection of SQL attacks that I'm missing?

 Answers

1

I would really say : don't try to validate names : one day or another, your code will meet a name that it thinks is "wrong"... And how do you think one would react when an application tells him "your name is not valid" ?

Depending on what you really want to achieve, you might consider using some kind of blacklist / filters, to exclude the "not-names" you thought about : it will maybe let some "bad-names" pass, but, at least, it shouldn't prevent any existing name from accessing your application.

Here are a few examples of rules that come to mind :

  • no number
  • no special character, like "~{()}@^$%?;:/*§£ø and probably some others
  • no more that 3 spaces ?
  • none of "admin", "support", "moderator", "test", and a few other obvious non-names that people tend to use when they don't want to type in their real name...
    • (but, if they don't want to give you their name, their still won't, even if you forbid them from typing some random letters, they could just use a real name... Which is not their's)

Yes, this is not perfect ; and yes, it will let some non-names pass... But it's probably way better for your application than saying someone "your name is wrong" (yes, I insist ^^ )


And, to answer a comment you left under one other answer :

I could just forbid the most command characters for SQL injection and XSS attacks,

About SQL Injection, you must escape your data before sending those to the database ; and, if you always escape those data (you should !), you don't have to care about what users may input or not : as it is escaped, always, there is no risk for you.

Same about XSS : as you always escape your data when ouputting it (you should !), there is no risk of injection ;-)


EDIT : if you just use that regex like that, it will not work quite well :

The following code :

$rexSafety = "/^[^<,"@/{}()*$%?=>:|;#]*$/i";
if (preg_match($rexSafety, 'martin')) {
    var_dump('bad name');
} else {
    var_dump('ok');
}

Will get you at least a warning :

Warning: preg_match() [function.preg-match]: Unknown modifier '{'

You must escape at least some of those special chars ; I'll let you dig into PCRE Patterns for more informations (there is really a lot to know about PCRE / regex ; and I won't be able to explain it all)

If you actually want to check that none of those characters is inside a given piece of data, you might end up with something like that :

$rexSafety = "/[^<,"@/{}()*$%?=>:|;#]+/i";
if (preg_match($rexSafety, 'martin')) {
    var_dump('bad name');
} else {
    var_dump('ok');
}

(This is a quick and dirty proposition, which has to be refined!)

This one says "OK" (well, I definitly hope my own name is ok!)
And the same example with some specials chars, like this :

$rexSafety = "/[^<,"@/{}()*$%?=>:|;#]+/i";
if (preg_match($rexSafety, 'ma{rtin')) {
    var_dump('bad name');
} else {
    var_dump('ok');
}

Will say "bad name"

But please note I have not fully tested this, and it probably needs more work ! Do not use this on your site unless you tested it very carefully !


Also note that a single quote can be helpful when trying to do an SQL Injection... But it is probably a character that is legal in some names... So, just excluding some characters might no be enough ;-)

Friday, August 19, 2022
4

This is a classic "password validation"-type problem. For this, the "rough recipe" is to check each condition with a lookahead, then we match everything.

^(?=(?:[^A-Z]*[A-Z]){3,9}[^A-Z]*$)(?=(?:[^0-9]*[0-9]){5,50}[^0-9]*$)[A-Z0-9]*$

I'll explain this one below, but here's a variation that I'll leave for you to figure out.

^(?=(?:[^A-Z]*[A-Z]){3,9}[0-9]*$)(?=(?:[^0-9]*[0-9]){5,50}[A-Z]*$).*$

Let's look at the first regex piece by piece.

  1. We anchor the regex between the head of string ^ and end of string $ assertions, ensuring that the match (if any) is the whole string.
  2. We have two lookaheads: one for the capital letters, one for the digits.
  3. After the lookaheads, [A-Z0-9]* matches the whole string (if it consists only of uppercase ASCII letters and digits). (Thanks to @TimPietzcker for pointing out that I was asleep at the wheel for starting out with a dot-star there.)

How do the lookaheads work?

The (?:[^A-Z]*[A-Z]){3,9}[^A-Z]*$) asserts that at the current position, i.e. the beginning of the string, we are able to match "any number of characters that are not capital letters, followed by a single capital letter", 3 to 9 times. This ensures we have enough capital letters. Note that the {3,9} is greedy, so we will match as many capital letters as possible. But we don't want to match more than we wish to allow, so after the expression quantifies by {3,9}, the lookahead checks that we can match "zero or any number" of characters that are not a capital letter, until the end of the string, marked by the anchor $.

The second lookahead works in similar fashion.

For a more in-depth explanation of this technique, you may want to peruse the password validation section of this page about regex lookarounds.

In case you are interested, here is a token-by-token explanation of the technique.

^                      the beginning of the string
(?=                    look ahead to see if there is:
 (?:                   group, but do not capture (between 3 and 9 times)
  [^A-Z]*              any character except: 'A' to 'Z' (0 or more times)
   [A-Z]               any character of: 'A' to 'Z'
 ){3,9}                end of grouping
  [^A-Z]*              any character except: 'A' to 'Z' (0 or more times)
$                      before an optional n, and the end of the string
)                      end of look-ahead
(?=                    look ahead to see if there is:
 (?:                   group, but do not capture (between 5 and 50 times)
  [^0-9]*              any character except: '0' to '9' (0 or more times)
   [0-9]               any character of: '0' to '9'
 ){5,50}               end of grouping
  [^0-9]*              any character except: '0' to '9' (0 or more times)
$                      before an optional n, and the end of the string
)                      end of look-ahead
[A-Z0-9]*              any character of: 'A' to 'Z', '0' to '9' (0 or more times)
$                      before an optional n, and the end of the string
Monday, September 5, 2022
 
5

The Arabic regex is:

[u0600-u06FF]

Actually, ?-? is a subset of this Arabic range, so I think you can remove them from the pattern.

So, in JS it will be

/^[a-z0-9+,()/'su0600-u06FF-]+$/i

See regex demo

Tuesday, October 11, 2022
4

A regex solution is easy. Simply assert a negative lookahead at the start of the string like so: (With comments...)

if (preg_match('%
    # Match non-http ,com or .net domain.
    ^             # Anchor to start of string.
    (?!           # Assert that this URL is NOT...
      https?://   # HTTP or HTTPS scheme with
      (?:www.)?  # optional www. subdomain.
    )             # End negative lookahead.
    .*            # Match up to TLD.
    .            # Last literal dot before TLD.
    (?:           # Group for TLD alternatives.
      net         # Either .net
    | com         # or .com.
    )             # End group of TLD alts.
    $             # Anchor to end of string.
    %xi', $text)) {
    // It matches.
} else {
    // It doesn't match.
}

Note that since: http://www. is a subset of: http://, the expression for the optional www. is not necessary. Here is a shorter version:

if (preg_match('%^(?!https?://).*.(?:net|com)$%i', $text)) {
    // It matches.
} else {
    // It doesn't match.
}

Simple regex to the rescue!

Wednesday, September 7, 2022
 
4

This can be done in a lot of ways, and also using regex. I'd personally use an array approach. First of all, I'd define the mangling table this way:

$table = array(
    'id' => 't.id',
    'name' => 't.name',
    'label' => 't.label',
    'related_value' => 'r.related_value'
);

This will make a lot easier the str_replace() call:

function mangling(&$v, $k, $table)
{
    if (($k & 1) == 0)
        $v = str_replace(array_keys($table), array_values($table), $v);
}

$spans = explode("'", ' ' . $input);
array_walk($spans, 'mangling', $table);
$output = implode("'", $spans);
Saturday, October 29, 2022
 
aladdin
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :