Viewed   319 times

After using PHP for a while now, I've noticed that not all built-in PHP functions are as fast as expected. Consider these two possible implementations of a function that finds if a number is prime using a cached array of primes.

//very slow for large $prime_array
$prime_array = array( 2, 3, 5, 7, 11, 13, .... 104729, ... );
$result_array = array();
foreach( $prime_array => $number ) {
    $result_array[$number] = in_array( $number, $large_prime_array );
}

//speed is much less dependent on size of $prime_array, and runs much faster.
$prime_array => array( 2 => NULL, 3 => NULL, 5 => NULL, 7 => NULL,
                       11 => NULL, 13 => NULL, .... 104729 => NULL, ... );
foreach( $prime_array => $number ) {
    $result_array[$number] = array_key_exists( $number, $large_prime_array );
}

This is because in_array is implemented with a linear search O(n) which will linearly slow down as $prime_array grows. Where the array_key_exists function is implemented with a hash lookup O(1) which will not slow down unless the hash table gets extremely populated (in which case it's only O(n)).

So far I've had to discover the big-O's via trial and error, and occasionally looking at the source code. Now for the question...

Is there a list of the theoretical (or practical) big O times for all* the built-in PHP functions?

*or at least the interesting ones

For example, I find it very hard to predict the big O of functions listed because the possible implementation depends on unknown core data structures of PHP: array_merge, array_merge_recursive, array_reverse, array_intersect, array_combine, str_replace (with array inputs), etc.

 Answers

1

Since it doesn't seem like anyone has done this before I thought it'd be good idea to have it for reference somewhere. I've gone though and either via benchmark or code-skimming to characterize the array_* functions. I've tried to put the more interesting Big-O near the top. This list is not complete.

Note: All the Big-O where calculated assuming a hash lookup is O(1) even though it's really O(n). The coefficient of the n is so low, the ram overhead of storing a large enough array would hurt you before the characteristics of lookup Big-O would start taking effect. For example the difference between a call to array_key_exists at N=1 and N=1,000,000 is ~50% time increase.

Interesting Points:

  1. isset/array_key_exists is much faster than in_array and array_search
  2. +(union) is a bit faster than array_merge (and looks nicer). But it does work differently so keep that in mind.
  3. shuffle is on the same Big-O tier as array_rand
  4. array_pop/array_push is faster than array_shift/array_unshift due to re-index penalty

Lookups:

array_key_exists O(n) but really close to O(1) - this is because of linear polling in collisions, but because the chance of collisions is very small, the coefficient is also very small. I find you treat hash lookups as O(1) to give a more realistic big-O. For example the different between N=1000 and N=100000 is only about 50% slow down.

isset( $array[$index] ) O(n) but really close to O(1) - it uses the same lookup as array_key_exists. Since it's language construct, will cache the lookup if the key is hardcoded, resulting in speed up in cases where the same key is used repeatedly.

in_array O(n) - this is because it does a linear search though the array until it finds the value.

array_search O(n) - it uses the same core function as in_array but returns value.

Queue functions:

array_push O(? var_i, for all i)

array_pop O(1)

array_shift O(n) - it has to reindex all the keys

array_unshift O(n + ? var_i, for all i) - it has to reindex all the keys

Array Intersection, Union, Subtraction:

array_intersect_key if intersection 100% do O(Max(param_i_size)*?param_i_count, for all i), if intersection 0% intersect O(?param_i_size, for all i)

array_intersect if intersection 100% do O(n^2*?param_i_count, for all i), if intersection 0% intersect O(n^2)

array_intersect_assoc if intersection 100% do O(Max(param_i_size)*?param_i_count, for all i), if intersection 0% intersect O(?param_i_size, for all i)

array_diff O(? param_i_size, for all i) - That's product of all the param_sizes

array_diff_key O(? param_i_size, for i != 1) - this is because we don't need to iterate over the first array.

array_merge O( ? array_i, i != 1 ) - doesn't need to iterate over the first array

+ (union) O(n), where n is size of the 2nd array (ie array_first + array_second) - less overhead than array_merge since it doesn't have to renumber

array_replace O( ? array_i, for all i )

Random:

shuffle O(n)

array_rand O(n) - Requires a linear poll.

Obvious Big-O:

array_fill O(n)

array_fill_keys O(n)

range O(n)

array_splice O(offset + length)

array_slice O(offset + length) or O(n) if length = NULL

array_keys O(n)

array_values O(n)

array_reverse O(n)

array_pad O(pad_size)

array_flip O(n)

array_sum O(n)

array_product O(n)

array_reduce O(n)

array_filter O(n)

array_map O(n)

array_chunk O(n)

array_combine O(n)

I'd like to thank Eureqa for making it easy to find the Big-O of the functions. It's an amazing free program that can find the best fitting function for arbitrary data.

EDIT:

For those who doubt that PHP array lookups are O(N), I've written a benchmark to test that (they are still effectively O(1) for most realistic values).

$tests = 1000000;
$max = 5000001;


for( $i = 1; $i <= $max; $i += 10000 ) {
    //create lookup array
    $array = array_fill( 0, $i, NULL );

    //build test indexes
    $test_indexes = array();
    for( $j = 0; $j < $tests; $j++ ) {
        $test_indexes[] = rand( 0, $i-1 );
    }

    //benchmark array lookups
    $start = microtime( TRUE );
    foreach( $test_indexes as $test_index ) {
        $value = $array[ $test_index ];
        unset( $value );
    }
    $stop = microtime( TRUE );
    unset( $array, $test_indexes, $test_index );

    printf( "%d,%1.15fn", $i, $stop - $start ); //time per 1mil lookups
    unset( $stop, $start );
}
Tuesday, September 20, 2022
4

Let C(n,k) be the cost of computing binom(n,k) in that way, with

C(0,_) = 1
C(_,0) = 1

as base cases.

The recurrence is obviously

C(n,k) = 1 + C(n-1,k-1) + C(n-1,k)

if we take addition to have cost 1. Then, we can easily check that

             k
C(n,k) = 2 * ∑ binom(n,j) - 1
            j=0

by induction.

So for k >= n, the cost is 2^(n+1) - 1, C(n,n-1) = 2^(n+1)- 3, C(n,1) = 2*n+1, C(n,2) = n*(n+1)+1, (and beyond that, I don't see neat formulae).

We have the obvious upper bound of

C(n,k) < 2^(n+1)

independent of k, and for k < n/2 we can coarsely estimate

C(n,k) <= 2*(k+1)*binom(n,k)

which gives a much smaller bound for small k, so overall

C(n,k) ∈ O(min{(k+1)*binom(n,min(k, n/2)), 2^n})

(need to clamp the k for the minimum, since binom(n,k) decreases back to 1 for k > n/2).

Wednesday, September 7, 2022
 
valex
 
5
  1. log5 x is the same as writing log log log log log x, which is a very slow-growing function of x.
  2. This is equivalent to 5 log x (rewriting exponentiation inside the log as multiplication outside), which is equivalent to log x.
  3. This is equivalent to log 6 + log log x, which is equivalent to log log x.
  4. This is just log log x.

So you have O(log log log log log x), O(log x), O(log log x) and O(log log x), three distinct Big-O classes.

If your interviewer said 3 and 4 were different, either he was mistaken or you've misremembered the question (happens all the time).

Saturday, August 6, 2022
 
talmihr
 
4

Algorithms with running time O(2^N) are often recursive algorithms that solve a problem of size N by recursively solving two smaller problems of size N-1.

This program, for instance prints out all the moves necessary to solve the famous "Towers of Hanoi" problem for N disks in pseudo-code

void solve_hanoi(int N, string from_peg, string to_peg, string spare_peg)
{
    if (N<1) {
        return;
    }
    if (N>1) {
        solve_hanoi(N-1, from_peg, spare_peg, to_peg);
    }
    print "move from " + from_peg + " to " + to_peg;
    if (N>1) {
        solve_hanoi(N-1, spare_peg, to_peg, from_peg);
    }
}

Let T(N) be the time it takes for N disks.

We have:

T(1) = O(1)
and
T(N) = O(1) + 2*T(N-1) when N>1

If you repeatedly expand the last term, you get:

T(N) = 3*O(1) + 4*T(N-2)
T(N) = 7*O(1) + 8*T(N-3)
...
T(N) = (2^(N-1)-1)*O(1) + (2^(N-1))*T(1)
T(N) = (2^N - 1)*O(1)
T(N) = O(2^N)

To actually figure this out, you just have to know that certain patterns in the recurrence relation lead to exponential results. Generally T(N) = ... + C*T(N-1) with C > 1means O(x^N). See:

https://en.wikipedia.org/wiki/Recurrence_relation

Thursday, September 15, 2022
 
fdreger
 
3

The complexity is only relevant if there is a variable N. So, the question makes no sense as is. If the question was:

A much slower algo would consist in testing every possible value in a range of N values, which would be about N times slower on average.

Is the second algorithm O(1) or O(N)?

Then the answer would be: this algorithm is O(N).

Tuesday, November 15, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :