# How does similar_text work?

Viewed   78 times

I just found the similar_text function and was playing around with it, but the percentage output always suprises me. See the examples below.

I tried to find information on the algorithm used as mentioned on php: `similar_text()`Docs:

``````<?php
\$p = 0;
similar_text('aaaaaaaaaa', 'aaaaa', \$p);
echo \$p . "<hr>";
//66.666666666667
//Since 5 out of 10 chars match, I would expect a 50% match

similar_text('aaaaaaaaaaaaaaaaaaaa', 'aaaaa', \$p);
echo \$p . "<hr>";
//40
//5 out of 20 > not 25% ?

similar_text('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaa', \$p);
echo \$p . "<hr>";
//9.5238095238095
//5 out of 100 > not 5% ?

//Example from PHP.net
//Why is turning the strings around changing the result?

similar_text('PHP IS GREAT', 'WITH MYSQL', \$p);
echo \$p . "<hr>"; //27.272727272727

similar_text('WITH MYSQL', 'PHP IS GREAT', \$p);
echo \$p . "<hr>"; //18.181818181818

?>
``````

Can anybody explain how this actually works?

Update:

Thanks to the comments I found that the percentage is actually calculated using the number of similar charactors * 200 / length1 + lenght 2

``````Z_DVAL_PP(percent) = sim * 200.0 / (t1_len + t2_len);
``````

So that explains why the percenatges are higher then expected. With a string with 5 out of 95 it turns out 10, so that I can use.

``````similar_text('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaa', \$p);
echo \$p . "<hr>";
//10
//5 out of 95 = 5 * 200 / (5 + 95) = 10
``````

But I still cant figure out why PHP returns a different result on turning the strings around. The JS code provided by dfsq doesn't do this. Looking at the source code in PHP I can only find a difference in the following line, but i'm not a c programmer. Some insight in what the difference is, would be appreciated.

In JS:

``````for (l = 0;(p + l < firstLength) && (q + l < secondLength) && (first.charAt(p + l) === second.charAt(q + l)); l++);
``````

In PHP: (php_similar_str function)

``````for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);
``````

Source:

``````/* {{{ proto int similar_text(string str1, string str2 [, float percent])
Calculates the similarity between two strings */
PHP_FUNCTION(similar_text)
{
char *t1, *t2;
zval **percent = NULL;
int ac = ZEND_NUM_ARGS();
int sim;
int t1_len, t2_len;

if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ss|Z", &t1, &t1_len, &t2, &t2_len, &percent) == FAILURE) {
return;
}

if (ac > 2) {
convert_to_double_ex(percent);
}

if (t1_len + t2_len == 0) {
if (ac > 2) {
Z_DVAL_PP(percent) = 0;
}

RETURN_LONG(0);
}

sim = php_similar_char(t1, t1_len, t2, t2_len);

if (ac > 2) {
Z_DVAL_PP(percent) = sim * 200.0 / (t1_len + t2_len);
}

RETURN_LONG(sim);
}
/* }}} */

/* {{{ php_similar_str
*/
static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int *pos2, int *max)
{
char *p, *q;
char *end1 = (char *) txt1 + len1;
char *end2 = (char *) txt2 + len2;
int l;

*max = 0;
for (p = (char *) txt1; p < end1; p++) {
for (q = (char *) txt2; q < end2; q++) {
for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);
if (l > *max) {
*max = l;
*pos1 = p - txt1;
*pos2 = q - txt2;
}
}
}
}
/* }}} */

/* {{{ php_similar_char
*/
static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2)
{
int sum;
int pos1, pos2, max;

php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);

if ((sum = max)) {
if (pos1 && pos2) {
sum += php_similar_char(txt1, pos1,
txt2, pos2);
}
if ((pos1 + max < len1) && (pos2 + max < len2)) {
sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,
txt2 + pos2 + max, len2 - pos2 - max);
}
}

return sum;
}
/* }}} */
``````

Source in Javascript: similar text port to javascript

4

It would indeed seem the function uses different logic depending of the parameter order. I think there are two things at play.

First, see this example:

``````echo similar_text('test','wert'); // 1
echo similar_text('wert','test'); // 2
``````

It seems to be that it is testing "how many times any distinct char on param1 is found in param2", and thus result would be different if you swap the params around. It has been reported as a bug, which has been closed as "working as expected".

Now, the above is the same for both PHP and javascript implementations - paremeter order has an impact, so saying that JS code wouldn't do this is wrong. This is argued in the bug entry as intended behaviour.

Second - what doesn't seem correct is the MYSQL/PHP word example. With that, javascript version gives 3 irrelevant of the order of params, whereas PHP gives 2 and 3 (and due to that, percentage is equally different). Now, the phrases "PHP IS GREAT" and "WITH MYSQL" should have 5 characters in common, irrelevant of which way you compare: H, I, S and T, one each, plus one for empty space. In order they have 3 characters, 'H', ' ' and 'S', so if you look at the ordering, correct answer should be 3 both ways. I modified the C code to a runnable version, and added some output, so one can see what is happening there (codepad link):

``````#include<stdio.h>

/* {{{ php_similar_str
*/
static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int *pos2, int *max)
{
char *p, *q;
char *end1 = (char *) txt1 + len1;
char *end2 = (char *) txt2 + len2;
int l;

*max = 0;
for (p = (char *) txt1; p < end1; p++) {
for (q = (char *) txt2; q < end2; q++) {
for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);
if (l > *max) {
*max = l;
*pos1 = p - txt1;
*pos2 = q - txt2;
}
}
}
}
/* }}} */

/* {{{ php_similar_char
*/
static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2)
{
int sum;
int pos1, pos2, max;

php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);

if ((sum = max)) {
if (pos1 && pos2) {
printf("txt here %s,%sn", txt1, txt2);
sum += php_similar_char(txt1, pos1,
txt2, pos2);
}
if ((pos1 + max < len1) && (pos2 + max < len2)) {
printf("txt here %s,%sn", txt1+ pos1 + max, txt2+ pos2 + max);
sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,
txt2 + pos2 + max, len2 - pos2 - max);
}
}

return sum;
}
/* }}} */
int main(void)
{
printf("Found %d similar charsn",
php_similar_char("PHP IS GREAT", 12, "WITH MYSQL", 10));
printf("Found %d similar charsn",
php_similar_char("WITH MYSQL", 10,"PHP IS GREAT", 12));
return 0;
}
``````

the result is output:

``````txt here PHP IS GREAT,WITH MYSQL
txt here P IS GREAT, MYSQL
txt here IS GREAT,MYSQL
txt here IS GREAT,MYSQL
txt here  GREAT,QL
Found 3 similar chars
txt here WITH MYSQL,PHP IS GREAT
txt here TH MYSQL,S GREAT
Found 2 similar chars
``````

So one can see that on the first comparison, the function found 'H', ' ' and 'S', but not 'T', and got the result of 3. The second comparison found 'I' and 'T' but not 'H', ' ' or 'S', and thus got the result of 2.

The reason for these results can be seen from the output: algorithm takes the first letter in the first string that second string contains, counts that, and throws away the chars before that from the second string. That is why it misses the characters in-between, and that's the thing causing the difference when you change the character order.

What happens there might be intentional or it might not. However, that's not how javascript version works. If you print out the same things in the javascript version, you get this:

``````txt here: PHP, WIT
txt here: P IS GREAT,  MYSQL
txt here: IS GREAT, MYSQL
txt here: IS, MY
txt here:  GREAT, QL
Found 3 similar chars
txt here: WITH, PHP
txt here: W, P
txt here: TH MYSQL, S GREAT
Found 3 similar chars
``````

showing that javascript version does it in a different way. What the javascript version does is that it finds 'H', ' ' and 'S' being in the same order in the first comparison, and the same 'H', ' ' and 'S' also on the second one - so in this case the order of params doesn't matter.

As the javascript is meant to duplicate the code of PHP function, it needs to behave identically, so I submitted bug report based on analysis of @Khez and the fix, which has been merged now.

Thursday, August 25, 2022
3

From the documentation:

List comprehensions provide a concise way to create lists. Common applications are to make new lists where each element is the result of some operations applied to each member of another sequence or iterable, or to create a subsequence of those elements that satisfy a certain condition.

About your question, the list comprehension does the same thing as the following "plain" Python code:

``````>>> l = []
>>> for x in range(10):
...     l.append(x**2)
>>> l
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
``````

How do you write it in one line? Hmm...we can...probably...use `map()` with `lambda`:

``````>>> list(map(lambda x: x**2, range(10)))
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
``````

But isn't it clearer and simpler to just use a list comprehension?

``````>>> [x**2 for x in range(10)]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
``````

Basically, we can do anything with `x`. Not only `x**2`. For example, run a method of `x`:

``````>>> [x.strip() for x in ('foon', 'barn', 'bazn')]
['foo', 'bar', 'baz']
``````

Or use `x` as another function's argument:

``````>>> [int(x) for x in ('1', '2', '3')]
[1, 2, 3]
``````

We can also, for example, use `x` as the key of a `dict` object. Let's see:

``````>>> d = {'foo': '10', 'bar': '20', 'baz': '30'}
>>> [d[x] for x in ['foo', 'baz']]
['10', '30']
``````

How about a combination?

``````>>> d = {'foo': '10', 'bar': '20', 'baz': '30'}
>>> [int(d[x].rstrip('0')) for x in ['foo', 'baz']]
[1, 3]
``````

And so on.

You can also use `if` or `if...else` in a list comprehension. For example, you only want odd numbers in `range(10)`. You can do:

``````>>> l = []
>>> for x in range(10):
...     if x%2:
...         l.append(x)
>>> l
[1, 3, 5, 7, 9]
``````

Ah that's too complex. What about the following version?

``````>>> [x for x in range(10) if x%2]
[1, 3, 5, 7, 9]
``````

To use an `if...else` ternary expression, you need put the `if ... else ...` after `x`, not after `range(10)`:

``````>>> [i if i%2 != 0 else None for i in range(10)]
[None, 1, None, 3, None, 5, None, 7, None, 9]
``````

Have you heard about nested list comprehension? You can put two or more `for`s in one list comprehension. For example:

``````>>> [i for x in [[1, 2, 3], [4, 5, 6]] for i in x]
[1, 2, 3, 4, 5, 6]

>>> [j for x in [[[1, 2], [3]], [[4, 5], [6]]] for i in x for j in i]
[1, 2, 3, 4, 5, 6]
``````

Let's talk about the first part, `for x in [[1, 2, 3], [4, 5, 6]]` which gives `[1, 2, 3]` and `[4, 5, 6]`. Then, `for i in x` gives `1`, `2`, `3` and `4`, `5`, `6`.

Warning: You always need put `for x in [[1, 2, 3], [4, 5, 6]]` before `for i in x`:

``````>>> [j for j in x for x in [[1, 2, 3], [4, 5, 6]]]
Traceback (most recent call last):
File "<input>", line 1, in <module>
NameError: name 'x' is not defined
``````

We also have set comprehensions, dict comprehensions, and generator expressions.

set comprehensions and list comprehensions are basically the same, but the former returns a set instead of a list:

``````>>> {x for x in [1, 1, 2, 3, 3, 1]}
{1, 2, 3}
``````

It's the same as:

``````>>> set([i for i in [1, 1, 2, 3, 3, 1]])
{1, 2, 3}
``````

A dict comprehension looks like a set comprehension, but it uses `{key: value for key, value in ...}` or `{i: i for i in ...}` instead of `{i for i in ...}`.

For example:

``````>>> {i: i**2 for i in range(5)}
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
``````

And it equals:

``````>>> d = {}
>>> for i in range(5):
...     d[i] = i**2
>>> d
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
``````

Does `(i for i in range(5))` give a tuple? No!, it's a generator expression. Which returns a generator:

``````>>> (i for i in range(5))
<generator object <genexpr> at 0x7f52703fbca8>
``````

It's the same as:

``````>>> def gen():
...     for i in range(5):
...         yield i
>>> gen()
<generator object gen at 0x7f5270380db0>
``````

And you can use it as a generator:

``````>>> gen = (i for i in range(5))
>>> next(gen)
0
>>> next(gen)
1
>>> list(gen)
[2, 3, 4]
>>> next(gen)
Traceback (most recent call last):
File "<input>", line 1, in <module>
StopIteration
``````

Note: If you use a list comprehension inside a function, you don't need the `[]` if that function could loop over a generator. For example, `sum()`:

``````>>> sum(i**2 for i in range(5))
30
``````

Related (about generators): Understanding Generators in Python.

Wednesday, October 5, 2022
4

The C and C++ standard do not have any requirement on how it has to work. A complying compiler may well decide to emit chained lists, `std::stack<boost::any>` or even magical pony dust (as per @Xeo's comment) under the hood.

However, it is usually implemented as follows, even though transformations like inlining or passing arguments in the CPU registers may not leave anything of the discussed code.

Please also note that this answer specifically describes a downwards growing stack in the visuals below; also, this answer is a simplification just to demonstrate the scheme (please see https://en.wikipedia.org/wiki/Stack_frame).

## How can a function be called with a non-fixed number of arguments

This is possible because the underlying machine architecture has a so-called "stack" for every thread. The stack is used to pass arguments to functions. For example, when you have:

``````foobar("%d%d%d", 3,2,1);
``````

Then this compiles to an assembler code like this (exemplary and schematically, actual code might look different); note that the arguments are passed from right to left:

``````push 1
push 2
push 3
push "%d%d%d"
call foobar
``````

Those push-operations fill up the stack:

``````              []   // empty stack
-------------------------------
push 1:       [1]
-------------------------------
push 2:       [1]
[2]
-------------------------------
push 3:       [1]
[2]
[3]  // there is now 1, 2, 3 in the stack
-------------------------------
push "%d%d%d":[1]
[2]
[3]
["%d%d%d"]
-------------------------------
call foobar   ...  // foobar uses the same stack!
``````

The bottom stack element is called the "Top of Stack", often abbreviated "TOS".

The `foobar` function would now access the stack, beginning at the TOS, i.e. the format string, which as you remember was pushed last. Imagine `stack` is your stack pointer , `stack[0]` is the value at the TOS, `stack[1]` is one above the TOS, and so forth:

``````format_string <- stack[0]
``````

... and then parses the format-string. While parsing, it recognozies the `%d`-tokens, and for each, loads one more value from the stack:

``````format_string <- stack[0]
offset <- 1
while (parsing):
token = tokenize_one_more(format_string)
if (needs_integer (token)):
value <- stack[offset]
offset = offset + 1
...
``````

This is of course a very incomplete pseudo-code that demonstrates how the function has to rely on the arguments passed to find out how much it has to load and remove from the stack.

## Security

This reliance on user-provided arguments is also one of the biggest security issues present (see https://cwe.mitre.org/top25/). Users may easily use a variadic function wrongly, either because they did not read the documentation, or forgot to adjust the format string or argument list, or because they are plain evil, or whatever. See also Format String Attack.

## C Implementation

In C and C++, variadic functions are used together with the `va_list` interface. While the pushing onto the stack is intrinsic to those languages (in K+R C you could even forward-declare a function without stating its arguments, but still call it with any number and kind arguments), reading from such an unknown argument list is interfaced through the `va_...`-macros and `va_list`-type, which basically abstracts the low-level stack-frame access.

Friday, November 11, 2022
2

If you don't want to use javascript, you can handle it via php. Take a look at this lib: http://code.google.com/p/php-mobile-detect/. And then you could do something like:

``````<?php
include 'Mobile_Detect.php';
\$detect = new Mobile_Detect();

if (\$detect->isMobile()) {
exit(0);
}
``````
Friday, October 21, 2022
1

Cookies are not the way to transfer variables between client and server. you should append key/variables pairs to your request URL using either a get (querystring) or post method.

jQuery ajax example;

``````\$.get('http://www.myphpserver.com/script.php?row_id=' + NewCookieValue);
``````
Monday, October 31, 2022