# Algorithms for string similarities (better than Levenshtein, and similar_text)? Php, Js

Viewed   71 times

Where can I find algorithms that values the spelling of misplaced characters more accurately than levenshtein() and php similar_text() methods?

Example:

``````similar_text('jonas', 'xxjon', \$similar); echo \$similar; // returns 60
similar_text('jonas', 'asjon', \$similar); echo \$similar; // returns 60 <- although more similar!
echo levenshtein('jonas', 'xxjon'); // returns 4
echo levenshtein('jonas', 'asjon'); // returns 4  <- although more similar!
``````

/ Jonas

4

Here's a solution that I've come up to. It's based on Tim's suggestion of comparing the order of subsequent charachters. Some results:

• jonas / jonax : 0.8
• jonas / sjona : 0.68
• jonas / sjonas : 0.66
• jonas / asjon : 0.52
• jonas / xxjon : 0.36

I'm sure i isn't perfect, and that it could be optimized, but nevertheless it seems to produce the results that I'm after... One weak spot is that when strings have different length, it produces different result when the values are swapped...

``````static public function string_compare(\$str_a, \$str_b)
{
\$length = strlen(\$str_a);
\$length_b = strlen(\$str_b);

\$i = 0;
\$segmentcount = 0;
\$segmentsinfo = array();
\$segment = '';
while (\$i < \$length)
{
\$char = substr(\$str_a, \$i, 1);
if (strpos(\$str_b, \$char) !== FALSE)
{
\$segment = \$segment.\$char;
if (strpos(\$str_b, \$segment) !== FALSE)
{
\$segmentpos_a = \$i - strlen(\$segment) + 1;
\$segmentpos_b = strpos(\$str_b, \$segment);
\$positiondiff = abs(\$segmentpos_a - \$segmentpos_b);
\$posfactor = (\$length - \$positiondiff) / \$length_b; // <-- ?
\$lengthfactor = strlen(\$segment)/\$length;
\$segmentsinfo[\$segmentcount] = array( 'segment' => \$segment, 'score' => (\$posfactor * \$lengthfactor));
}
else
{
\$segment = '';
\$i--;
\$segmentcount++;
}
}
else
{
\$segment = '';
\$segmentcount++;
}
\$i++;
}

// PHP 5.3 lambda in array_map
\$totalscore = array_sum(array_map(function(\$v) { return \$v['score'];  }, \$segmentsinfo));
return \$totalscore;
}
``````
Monday, August 1, 2022
1
``````function substr_with_ellipsis(\$string, \$chars = 100)
{
preg_match('/^.{0,' . \$chars. '}(?:.*?)b/iu', \$string, \$matches);
\$new_string = \$matches[0];
return (\$new_string === \$string) ? \$string : \$new_string . '&hellip;';
}
``````
Tuesday, December 20, 2022
4

This is a classic "password validation"-type problem. For this, the "rough recipe" is to check each condition with a lookahead, then we match everything.

``````^(?=(?:[^A-Z]*[A-Z]){3,9}[^A-Z]*\$)(?=(?:[^0-9]*[0-9]){5,50}[^0-9]*\$)[A-Z0-9]*\$
``````

I'll explain this one below, but here's a variation that I'll leave for you to figure out.

``````^(?=(?:[^A-Z]*[A-Z]){3,9}[0-9]*\$)(?=(?:[^0-9]*[0-9]){5,50}[A-Z]*\$).*\$
``````

Let's look at the first regex piece by piece.

1. We anchor the regex between the head of string ^ and end of string \$ assertions, ensuring that the match (if any) is the whole string.
2. We have two lookaheads: one for the capital letters, one for the digits.
3. After the lookaheads, `[A-Z0-9]*` matches the whole string (if it consists only of uppercase ASCII letters and digits). (Thanks to @TimPietzcker for pointing out that I was asleep at the wheel for starting out with a dot-star there.)

The `(?:[^A-Z]*[A-Z]){3,9}[^A-Z]*\$)` asserts that at the current position, i.e. the beginning of the string, we are able to match "any number of characters that are not capital letters, followed by a single capital letter", 3 to 9 times. This ensures we have enough capital letters. Note that the `{3,9}` is greedy, so we will match as many capital letters as possible. But we don't want to match more than we wish to allow, so after the expression quantifies by `{3,9}`, the lookahead checks that we can match "zero or any number" of characters that are not a capital letter, until the end of the string, marked by the anchor `\$`.

The second lookahead works in similar fashion.

For a more in-depth explanation of this technique, you may want to peruse the password validation section of this page about regex lookarounds.

In case you are interested, here is a token-by-token explanation of the technique.

``````^                      the beginning of the string
(?=                    look ahead to see if there is:
(?:                   group, but do not capture (between 3 and 9 times)
[^A-Z]*              any character except: 'A' to 'Z' (0 or more times)
[A-Z]               any character of: 'A' to 'Z'
){3,9}                end of grouping
[^A-Z]*              any character except: 'A' to 'Z' (0 or more times)
\$                      before an optional n, and the end of the string
(?=                    look ahead to see if there is:
(?:                   group, but do not capture (between 5 and 50 times)
[^0-9]*              any character except: '0' to '9' (0 or more times)
[0-9]               any character of: '0' to '9'
){5,50}               end of grouping
[^0-9]*              any character except: '0' to '9' (0 or more times)
\$                      before an optional n, and the end of the string
[A-Z0-9]*              any character of: 'A' to 'Z', '0' to '9' (0 or more times)
\$                      before an optional n, and the end of the string
``````
Monday, September 5, 2022
5

Here's what worked best for me when trying to script this (in case anyone else comes across this like I did):

``````\$ pecl -d php_suffix=5.6 install <package>
\$ pecl uninstall -r <package>

\$ pecl -d php_suffix=7.0 install <package>
\$ pecl uninstall -r <package>

\$ pecl -d php_suffix=7.1 install <package>
\$ pecl uninstall -r <package>
``````

The `-d php_suffix=<version>` piece allows you to set config values at run time vs pre-setting them with `pecl config-set`. The `uninstall -r` bit does not actually uninstall it (from the docs):

``````vagrant@homestead:~\$ pecl help uninstall
pecl uninstall [options] [channel/]<package> ...
Uninstalls one or more PEAR packages.  More than one package may be
specified at once.  Prefix with channel name to uninstall from a
channel not in your default channel (pecl.php.net)

Options:
...
-r, --register-only
do not remove files, only register the packages as not installed
...
``````

The uninstall line is necessary otherwise installing it will remove any previously installed version, even if it was for a different PHP version (ex: Installing an extension for PHP 7.0 would remove the 5.6 version if the package was still registered as installed).

Monday, December 12, 2022
4

Never used any of those, but they look interesting..

Take a look at Gearman as well.. more overhead in systems like these but you get other cool stuff :) Guess it depends on your needs ..

Friday, November 11, 2022