Viewed   55 times

I am a PHP developer and I have always thought that micro-optimizations are not worth the time. If you really need that extra performance, you would either write your software so that it's architecturally faster, or you write a C++ extension to handle slow tasks (or better yet, compile the code using HipHop). However, today a work mate told me that there is a big difference in



$array === (array) $array

and I was like "eh, that's a pointless comparison really", but he wouldn't agree with me.. and he is the best developer in our company and is taking charge of a website that does about 50 million SQL queries per day -- for instance. So, I am wondering here: could he be wrong or is micro-optimization really worth the time and when?



Micro-optimisation is worth it when you have evidence that you're optimising a bottleneck.

Usually it's not worth it - write the most readable code you can, and use realistic benchmarks to check the performance. If and when you find you've got a bottleneck, micro-optimise just that bit of code (measuring as you go). Sometimes a small amount of micro-optimisation can make a huge difference.

But don't micro-optimise all your code... it will end up being far harder to maintain, and you'll quite possibly find you've either missed the real bottleneck, or that your micro-optimisations are harming performance instead of helping.

Monday, December 26, 2022

This is a very interesting question that doesn't appear to have a great answer. I did some very unscientific bench-marking and I was not able to get any faster than in_array for a $haystack with 100000 elements.

PHP 5.5.9-1ubuntu4.14 (cli) (built: Oct 28 2015 01:34:46) 
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
    with Zend OPcache v7.0.3, Copyright (c) 1999-2014, by Zend Technologies
    with Xdebug v2.2.3, Copyright (c) 2002-2013, by Derick Rethans

Sorting Time*:    0.19367408752441
Imploding Time**: 0.0207359790802
preg_match:       0.10927486419678
needle ===:       0.083639144897461
in_array:         0.019428968429565
array_flip:       0.028955936431885
array_intersect:  0.15198707580566
array_diff:       0.15532493591309

//*sort without search (binary search wouldn't add much time)
//**time it took to implode the array 
//     (no search was performed, this search WOULD take significant time if implemented)

As you can see, only three of these methods took less than 100ms, needle ===, in_array and array_flip. And out of these three, in_array was clearly the fastest. Now the question is how many postfix-es do you have? The running time on in_array will be O(n*m) (n is size of your haystack, m is the number of postfixes), which is a problem if m is also very large. If m is significantly large, sorting the data once and performing a binary search on the sorted list will be O(m*log(n)) which grows much slower, but has a higher initial overhead as shown in the sorting time above. Even better, if you have a very large m would probably be array_flip as each search should only take O(1) lookup after the initial flip.


Haystack creation

$haystack = array();

function getRandomWord($len = 10) {
        $len = rand(3,10);
        $word = array_merge(range('a', 'z'), range('A', 'Z'));
            return substr(implode($word), 0, $len);

$numWords = 100000;
for($i = 0; $i < $numWords; $i++) {
    $haystack[] = getRandomWord();


$copy = $haystack;

$copy = implode($haystack, " ");

function preg_match_test($regex, $haystack) {
    $matches = false;
    foreach($haystack as $value) {
        if (preg_match($regex, $value)) {
            $matches = true;
    return $matches;

//needle ===
function equalsNeedle($needles, $haystack) {
    $matches = false;
    foreach ($haystack as $value) {
        foreach($needles as $needle) {
            if ($needle === $value) {
                $matches = true;
                break 2;
    return $matches;

function baseCase($needles, $haystack) {
    $matches = false;
    foreach($needles as $needle) {
        if (in_array($needle, $haystack)) {
            $matches = true;
    return $matches;

function arrayFlipping($needles, $haystack) {
    $matches = false;
    $copy = array_flip($haystack);
    foreach ($needles as $needle) {
        if (array_key_exists($needle, $copy)) {
            $matches = true;
    return $matches;

function arrayIntersect($needles, $haystack) {
    if (count(array_intersect($needles, $haystack)) > 0) {
        return true;
    return false;

function arrayDiff($needles, $haystack) {
    if (count(array_diff($needles, $haystack)) !== count($needles)) {
        return true;
    return false;

Calling Code

$array = array("foo","foobar","foobazz","foobuzz");
$base = "foo";
$regex = "/^$base(bizz|bazz|buzz|)$/";

echo "preg_match: ";
preg_match_test($regex, $haystack);
echo "needle === ";
equalsNeedle($array, $haystack);
echo "in_array:  ";
baseCase($array, $haystack);
echo "array_flip:  ";
arrayFlipping($array, $haystack);
echo "array_intersect:  ";
arrayIntersect($array, $haystack);
echo "array_diff:  ";
arrayDiff($array, $haystack);

All tests were wrapped with timing code using microtime(true).

Sunday, August 28, 2022

OK, you're defining the problem to where it would seem there is not much room for improvement. That is fairly rare, in my experience. I tried to explain this in a Dr. Dobbs article in November 1993, by starting from a conventionally well-designed non-trivial program with no obvious waste and taking it through a series of optimizations until its wall-clock time was reduced from 48 seconds to 1.1 seconds, and the source code size was reduced by a factor of 4. My diagnostic tool was this. The sequence of changes was this:

  • The first problem found was use of list clusters (now called "iterators" and "container classes") accounting for over half the time. Those were replaced with fairly simple code, bringing the time down to 20 seconds.

  • Now the largest time-taker is more list-building. As a percentage, it was not so big before, but now it is because the bigger problem was removed. I find a way to speed it up, and the time drops to 17 seconds.

  • Now it is harder to find obvious culprits, but there are a few smaller ones that I can do something about, and the time drops to 13 sec.

Now I seem to have hit a wall. The samples are telling me exactly what it is doing, but I can't seem to find anything that I can improve. Then I reflect on the basic design of the program, on its transaction-driven structure, and ask if all the list-searching that it is doing is actually mandated by the requirements of the problem.

Then I hit upon a re-design, where the program code is actually generated (via preprocessor macros) from a smaller set of source, and in which the program is not constantly figuring out things that the programmer knows are fairly predictable. In other words, don't "interpret" the sequence of things to do, "compile" it.

  • That redesign is done, shrinking the source code by a factor of 4, and the time is reduced to 10 seconds.

Now, because it's getting so quick, it's hard to sample, so I give it 10 times as much work to do, but the following times are based on the original workload.

  • More diagnosis reveals that it is spending time in queue-management. In-lining these reduces the time to 7 seconds.

  • Now a big time-taker is the diagnostic printing I had been doing. Flush that - 4 seconds.

  • Now the biggest time-takers are calls to malloc and free. Recycle objects - 2.6 seconds.

  • Continuing to sample, I still find operations that are not strictly necessary - 1.1 seconds.

Total speedup factor: 43.6

Now no two programs are alike, but in non-toy software I've always seen a progression like this. First you get the easy stuff, and then the more difficult, until you get to a point of diminishing returns. Then the insight you gain may well lead to a redesign, starting a new round of speedups, until you again hit diminishing returns. Now this is the point at which it might make sense to wonder whether ++i or i++ or for(;;) or while(1) are faster: the kinds of questions I see so often on Stack Overflow.

P.S. It may be wondered why I didn't use a profiler. The answer is that almost every one of these "problems" was a function call site, which stack samples pinpoint. Profilers, even today, are just barely coming around to the idea that statements and call instructions are more important to locate, and easier to fix, than whole functions.

I actually built a profiler to do this, but for a real down-and-dirty intimacy with what the code is doing, there's no substitute for getting your fingers right in it. It is not an issue that the number of samples is small, because none of the problems being found are so tiny that they are easily missed.

ADDED: jerryjvl requested some examples. Here is the first problem. It consists of a small number of separate lines of code, together taking over half the time:

if (ptop->current_task >= ILST_LENGTH(ptop->tasklist){
. . .
for ( ptop = ILST_FIRST(oplist); ptop != NULL; ptop = ILST_NEXT(oplist, ptop)){
. . .
ptask = ILST_NTH(ptop->tasklist, ptop->current_task)

These were using the list cluster ILST (similar to a list class). They are implemented in the usual way, with "information hiding" meaning that the users of the class were not supposed to have to care how they were implemented. When these lines were written (out of roughly 800 lines of code) thought was not given to the idea that these could be a "bottleneck" (I hate that word). They are simply the recommended way to do things. It is easy to say in hindsight that these should have been avoided, but in my experience all performance problems are like that. In general, it is good to try to avoid creating performance problems. It is even better to find and fix the ones that are created, even though they "should have been avoided" (in hindsight). I hope that gives a bit of the flavor.

Here is the second problem, in two separate lines:

ILST_APPEND(ptop->tasklist, ptask)
. . .
ILST_APPEND(trnque, ptrn)

These are building lists by appending items to their ends. (The fix was to collect the items in arrays, and build the lists all at once.) The interesting thing is that these statements only cost (i.e. were on the call stack) 3/48 of the original time, so they were not in fact a big problem at the beginning. However, after removing the first problem, they cost 3/20 of the time and so were now a "bigger fish". In general, that's how it goes.

I might add that this project was distilled from a real project I helped on. In that project, the performance problems were far more dramatic (as were the speedups), such as calling a database-access routine within an inner loop to see if a task was finished.

REFERENCE ADDED: The source code, both original and redesigned, can be found in, for 1993, in file, files slug.asc and

EDIT 2011/11/26: There is now a SourceForge project containing source code in Visual C++ and a blow-by-blow description of how it was tuned. It only goes through the first half of the scenario described above, and it doesn't follow exactly the same sequence, but still gets a 2-3 order of magnitude speedup.

Monday, December 5, 2022

This should be possible at about 8 elements (1 AVX2 vector) per 2.5 clock cycles or so (per core) on a modern x86-64 like Skylake or Zen 2, using AVX2. Or per 2 clocks with unrolling. Or on your Piledriver CPU, maybe 1x 16-byte vector of indexes per 3 clocks with AVX1 _mm_cmpeq_epi32.

The general strategy works with 2 to 8 buckets. And for byte, 16-bit, or 32-bit elements. (So byte elements gives you 32 elements histogrammed per 2 clock cycles best case, with a bit of outer-loop overhead to collect byte counters before they overflow.)

Update: or mapping an int to 1UL << (array[i]*8) to increment one of 4 bytes of a counter with SIMD / SWAR addition, we can go close to 1 clock per vector of 8 int on SKL, or per 2 clocks on Zen2. (This is even more specific to 4 or fewer buckets, and int input, and doesn't scale down to SSE2. It needs variable-shifts or at least AVX1 variable-shuffles.) Using byte elements with the first strategy is probably still better in terms of elements per cycle.

As @JonasH points out, you could have different cores working on different parts of the input array. A single core can come close to saturating memory bandwidth on typical desktops, but many-core Xeons have lower per-core memory bandwidth and higher aggregate, and need more cores to saturate L3 or DRAM bandwidth. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

A loop that runs for days at a time.

On a single input list that's very very slow to iterate so it still doesn't overflow int counters? Or repeated calls with different large Lists (like your ~900k test array)?

I believe I want to avoid increasing an index for a list or array as it seem to consume a lot of time?

That's probably because you were benchmarking with optimization disabled. Don't do that, it's not meaningful at all; different code is slowed down different amounts by disabling optimization. More explicit steps and tmp vars can often make slower debug-mode code because there are more things that have to be there to look at with a debugger. But they can just optimize into a normal pointer-increment loop when you compile with normal optimization.

Iterating through an array can compile efficiently into asm.

The slow part is the dependency chain through memory for incrementing a variable index of the array. For example on a Skylake CPU, memory-destination add with the same address repeatedly bottlenecks at about one increment per 6 clock cycles because the next add has to wait to load the value stored by the previous one. (Store-forwarding from the store buffer means it doesn't have to wait for it to commit to cache first, but it's still much slower than add into a register.) See also Agner Fog's optimization guides:

With counts only distributed across 4 buckets, you'll have a lot of cases where instructions are waiting to reload the data stored by another recent instruction, so you can't even achieve the nearly 1 element per clock cycle you might if counts were well distributed over more counters that still were all hot in L1d cache.

One good solution to this problem is unrolling the loop with multiple arrays of counters. Methods to vectorise histogram in SIMD?. Like instead of int[] indexes = { 0, 0, 0, 0 }; you can make it a 2D array of four counters each. You'd have to manually unroll the loop in the source to iterate over the input array, and handle the last 0..3 left over elements after the unrolled part.

This is a good technique for small to medium arrays of counts, but becomes bad if replicating the counters starts to lead to cache misses.

Use narrow integers to save cache footprint / mem bandwidth.

Another thing you can/should do is use as narrow a type as possible for your arrays of 0..3 values: each number can fit in a byte so using 8-bit integers would save you a factor of 4 cache footprint / memory bandwidth.

x86 can efficiently load/store bytes into to/from full registers. With SSE4.1 you also have SIMD pmovzxbd to make it more efficient to auto-vectorize when you have a byte_array[i] used with an int_array[i] in a loop.

(When I say x86 I mean including x86-64, as opposed to ARM or PowerPC. Of course you don't actually want to compile 32-bit code, what Microsoft calls "x86")

With very small numbers of buckets, like 4

This looks like a job for SIMD compares. With x86 SSE2 the number of int elements per 16-byte vector of data is equal to your number of histogram bins.

You already had a SIMD sort of idea with trying to treat a number as four separate byte elements. See

But 00_01_10_11 is just source-level syntax for human-readable separators in numbers, and double is a floating-point type whose internal representation isn't the same as for integers. And you definitely don't want to use strings; SIMD lets you do stuff like operating on 4 elements of an integer array at once.

The best way I can see to approach this is to separately count matches for each of the 4 values, rather than map elements to counters. We want to process multiple elements in parallel but mapping them to counters can have collisions when there are repeated values in one vector of elements. You'd need to increment that counter twice.

The scalar equivalent of this is:

int counts[4] = {0,0,0,0};
for () {
    counts[0] += (arr[i] == 0);
    counts[1] += (arr[i] == 1);
    counts[2] += (arr[i] == 2);  // count matches
  //counts[3] += (arr[i] == 3);  // we assume any that aren't 0..2 are this
counts[3] = size - counts[0] - counts[1] - counts[2];
// calculate count 3 from other counts

which (in C++) GCC -O3 will actually auto-vectorize exactly the way I did manually below: Clang even unrolls it when auto-vectorizing, so it should be better than my hand-vectorized version for int inputs. Still not as good as the alternate vpermilps strategy for that case, though.

(And you do still need to manually vectorize if you want byte elements with efficient narrow sums, only widening in an outer loop.)

With byte elements, see How to count character occurrences using SIMD. The element size is too narrow for a counter; it would overflow after 256 counts. So you have to widen either in the inner loop, or use nested loops to do some accumulating before widening.

I don't know C#, so I could write the code in x86 assembly or in C++ with intrinsics. Perhaps C++ intrinsics is more useful for you. C# has some kind of vector extensions that should make it possible to port this.

This is C++ for x86-64, using AVX2 SIMD intrinsics. See https://.com/tags/sse/info for some info.

// Manually vectorized for AVX2, for int element size
// Going nearly 4x as fast should be possible for byte element size

#include <immintrin.h>

void count_elements_avx2(const std::vector<int> &input,  unsigned output_counts[4])
    __m256i  counts[4] = { _mm256_setzero_si256() };  // 4 vectors of zeroed counters
                  // each vector holds counts for one bucket, to be hsummed at the end

    size_t size = input.size();
    for(size_t i = 0 ; i<size ; i+=8) {  // 8x 32-bit elements per vector
        __m256i v = _mm256_loadu_si256((const __m256i*)&input[i]);  // unaligned load of 8 ints
        for (int val = 0 ; val < 3; val++) {
           // C++ compilers will unroll this with 3 vector constants and no memory access
            __m256i match = _mm256_cmpeq_epi32(v, _mm256_set1_epi32(val));  // 0 or all-ones aka -1
            counts[val] = _mm256_sub_epi32(counts[val], match);   // x -= -1 or 0 conditional increment

    // transpose and sum 4 vectors of 8 elements down to 1 vector of 4 elements
    __m128i summed_counts = hsum_xpose(counts);   // helper function defined in Godbolt link
    _mm_storeu_si128((__m128i*)output_counts, summed_counts);

    output_counts[3] = size - output_counts[0]
                       - output_counts[1] - output_counts[2];

    // TODO: handle the last size%8 input elements; scalar would be easy

This compiles nicely with clang (on the Godbolt compiler explorer). Presumably you can write C# that compiles to similar machine code. If not, consider calling native code from a C++ compiler (or hand-written in asm if you can't get truly optimal code from the compiler). If your real use-case runs as many iterations as your benchmark, that could amortize the extra overhead if the input array doesn't have to get copied.

 # from an earlier version of the C++, doing all 4 compares in the inner loop
 # clang -O3 -march=skylake
.LBB0_2:                                     # do {
    vmovdqu ymm7, ymmword ptr [rcx + 4*rdx]    # v = load arr[i + 0..7]
    vpcmpeqd        ymm8, ymm7, ymm3           # compare v == 0
    vpsubd  ymm4, ymm4, ymm8                   # total0 -= cmp_result
    vpcmpeqd        ymm8, ymm7, ymm5
    vpsubd  ymm2, ymm2, ymm8
    vpcmpeqd        ymm7, ymm7, ymm6           # compare v == 2
    vpsubd  ymm1, ymm1, ymm7                   # total2 -= cmp_result
    add     rdx, 8                             # i += 8
    cmp     rdx, rax
    jb      .LBB0_2                          # }while(i < size)

Estimated best-case Skylake performance: ~2.5 cycles per vector (8 int or 32 int8_t)

Or 2 with unrolling.

Without AVX2, using only SSE2, you'd have some extra movdqa instructions and only be doing 4 elements per vector. This would still be a win vs. scalar histogram in memory, though. Even 1 element / clock is nice, and should be doable with SSE2 that can run on any x86-64 CPU.

Assuming no cache misses of course, with hardware prefetch into L1d staying ahead of the loop. This might only happen with the data already hot in L2 cache at least. I'm also assuming no stalls from memory alignment; ideally your data is aligned by 32 bytes. If it usually isn't, possibly worth processing the first unaligned part and then using aligned loads, if the array is large enough.

For byte elements, the inner-most loop will look similar (with vpcmpeqb and vpsubb but run only at most 255 (not 256) iterations before hsumming to 64-bit counters, to avoid overflow. So throughput per vector will be the same, but with 4x as many elements per vector.

See and for performance analysis details. e.g. vpcmpeqd on

The inner loop is only 9 fused-domain uops for Haswell/Skylake, so best case front-end bottleneck of about 1 iteration per 2.25 cycles (the pipeline is 4 uops wide). Small-loop effects get in the way somewhat: Is performance reduced when executing loops whose uop count is not a multiple of processor width? - Skylake has its loop buffer disabled by a microcode update for an erratum, but even before that a 9 uop loop ended up issuing slightly worse than one iter per 2.25 cycles on average, let's say 2.5 cycles.

Skylake runs vpsubd on ports 0,1, or 5, and runs vpcmpeqd on ports 0 or 1. So the back-end bottleneck on ports 0,1,5 is 6 vector ALU uops for 3 ports, or 1 iteration per 2 cycles. So the front-end bottleneck dominates. (Ice Lake's wider front-end may let it bottleneck on the back-end even without unrolling; same back-end throughputs there unless you use AVX512...)

If clang had indexed from the end of the array and counted the index up towards zero (since it chose to use an indexed addressing mode anyway) it could have saved a uop for a total of 8 uops = one iter per 2 cycles in the front-end, matching the back-end bottleneck. (Either way, scalar add and macro-fused cmp/jcc, or add/jcc loop branch can run on port 6, and the load doesn't compete for ALU ports.) Uop replays of ALU uops dependent on the load shouldn't be a problem even on cache misses, if ALU uops are the bottleneck there will normally be plenty of older uops just waiting for an execution unit to be ready, not waiting for load data.

Unrolling by 2 would have the same benefit: amortizing that 2 uops of loop overhead. So 16 uops for 2 input vectors. That's a nice multiple of the pipeline width on SKL and IceLake, and the single-uop pipeline width on Zen. Unrolling even more could let the front-end can stay ahead of execution, but with them even any back-end delays will let the front end build up a cushion of uops in the scheduler. This will let it execute loads early enough.

Zen2 has a wider front-end (6 uops or 5 instructions wide, IIUC). None of these instructions are multi-uop because Zen2 widened the vector ALUs to 256-bit, so that's 5 single-uop instructions. vpcmpeq* runs on FP 0,1, or 3, same as vpsubd, so the back-end bottleneck is the same as on Skylake: 1 vector per 2 cycles. But the wider front-end removes that bottleneck, leaving the critical path being the back-end even without unrolling.

Zen1 takes 2 uops per 256-bit vector operation (or more for lane-crossing, but these are simple 2 uop). So presumably 12/3 = 4 cycles per vector of 8 or 32 elements, assuming it can get those uops through the front-end efficiently.

I'm assuming that the 1-cycle latency dependency chains through the count vectors are scheduled well by the back-ends and don't result in many wasted cycles. Probably not a big deal, especially if you have any memory bottlenecks in real life. (On Piledriver, SIMD-integer operations have 2 cycle latency, but 6 ALU uops for 2 vector ALU ports that can run them is 1 vector (128-bit) per 3 cycles so even without unrolling there's enough work to hide that latency.)

I didn't analyze the horizontal-sum part of this. It's outside the loop so it only has to run once per call. You did tag this micro-optimization, but we probably don't need to worry about that part.

Other numbers of buckets

The base case of this strategy is 2 buckets: count matches for one thing, count_other = size - count.

We know that every element is one of these 4 possibilities, so we can assume that any x that isn't 0, 1, or 2 is a 3 without checking. This means we don't have to count matches for 3 at all, and can get the count for that bucket from size - sum(counts[0..2]).

(See the edit history for the above perf analysis before doing this optimizations. I changed the numbers after doing this optimization and updating the Godbolt link, hopefully I didn't miss anything.)

AVX512 on Skylake-Xeon

For 64-byte vectors there is no vpcmpeqd to make a vector of all-zero (0) or all-one (-1) elements. Instead you'd compare into a mask register and use that to do a merge-masked add of set1(1). Like c = _mm512_mask_add_epi32(c, _mm512_set1_epi32(1)).

Unfortunately it's not efficient to do a scalar popcount of the compare-result bitmasks.

Random code review: in your first benchmark:

int[] valueLIST = indexers.ToArray();

This seems pointless; According to MS's docs (, a List is efficiently indexable. I think it's equivalent to C++ std::vector<T>. You can just iterate it without copying to an array.

Alt strategy - map 0..3 to a set a bit in one byte of an int

Good if you can't narrow your elements to bytes for the input to save mem bandwidth.

But speaking of which, maybe worth it to use 2x _mm256_packs_epi32 (vpackssdw) and _mm256_packs_epi16 (vpacksswb) to narrow down to 8-bit integers before counting with 3x pcmpeqb / psubb. That costs 3 uops per 4 input vectors to pack down to 1 with byte elements.

But if your input has int elements to start with, this may be best instead of packing and then comparing 3 ways.

You have 4 buckets, and an int has 4 bytes. If we can transform each int element to a 1 at the bottom of the appropriate byte, that would let us add with _mm256_add_epi8 for up to 255 inner-loop iterations before widening to 64-bit counters. (With the standard _mm256_sad_epu8 against zero trick to hsum unsigned bytes without overflow.)

There are 2 ways to do this. The first: use a shuffle as a lookup table. AVX2 vpermd works (_mm256_permutexvar_epi32) using the data as the index vector and a constant _mm256_set_epi32(0,0,0,0, 1UL<<24, 1UL<<16, 1UL<<8, 1UL<<0) as the data being shuffled. Or type-pun the vector to use AVX1 vpermilps as a LUT with the LUT vector having those bytes in the upper half as well.

vpermilps is better: it's fewer uops on AMD Zen 1, and lower latency everywhere because it's in-lane. (Might cause a bypass delay on some CPUs, cutting into the latency benefit, but still not worse than vpermd).

For some reason vpermilps with a vector control has 2 cycle throughput on Zen2 even though it's still a single uop. Or 4 cycles on Zen1 (for the 2 uop YMM version). It's 1 cycle on Intel. vpermd is even worse on AMD: more uops and same poor throughput.

vpermilps xmm (16-byte vector) on Piledriver has 1/clock throughput according to Agner Fog's testing, and runs in the "ivec" domain. (So it actually has extra bypass delay latency when used on the "intended" floating point operands, but not on integer).

   // Or for Piledriver, __m128 version of this

    __m256 bytepatterns = _mm256_casts256_ps(_mm256_set_epi32(
         1<<24, 1<<16, 1<<8, 1<<0,
         1<<24, 1<<16, 1<<8, 1<<0) );
    __m256i v = _mm256_loadu_si256((const __m256i*)&input[i]);
    v = _mm256_castps_si256(_mm256_permutevar_ps(bytepatterns, v));  // vpermilps 32-bit variable shuffle
    counts = _mm256_add_epi8(counts, v);

     // after some inner iterations, separate out the 
     // set1_epi32(0x000000ff) counts, 0x0000ff00 counts, etc.

This will produce interleaved counters inside each int element. They will overflow if you don't accumulate them before 256 counts. See How to count character occurrences using SIMD for a simple version of that with a single counter.

Here we might unroll and use 2 different LUT vectors so when we want to group all the counts for 0 together, we could blend 2 vectors together and mask away the others.

Alternatively to shuffling, we can do this with AVX2 variable shifts.

sums += 1UL << (array[i]*8); where the *8 is the number of bits in a byte, also done with a shift. I wrote it as a scalar C++ expression because now's your chance to see how your bytes-in-an-integer idea can really work. As long as we don't let an individual byte overflow, it doesn't matter if SIMD bytes adds block carry between bytes or if we use 32-bit dword elements.

We'd do this with AVX2 as:

__m256i v = loadu...();
v = _mm256_slli_epi32(v, 3);  // v *= 8
v = _mm256_sllv_epi32(_mm256_set1_epi32(1), v);
counts = _mm256_add_epi8(counts, v);

This is 2 shift instructions plus the vpaddb. On Skylake the variable-count shifts vpsllvd is cheap: single-uop and runs on multiple ports. But on Haswell and Zen it's slower. (Same throughput as vpermilps on AMD)

And 2 uops for 2 ports still doesn't beat 1 uop for 1 port for the shuffle version. (Unless you use both strategies alternating to distribute the work over all ALU ports on SKL.)

So either way the inner-most loop can go 1 vector per clock or maybe slightly better with careful interleaving of shift vs. shuffle methods.

But it will require some small amount of overhead amortized over 128 or 255 inner loop iterations.

That cleanup at the end might blend 2 vectors together to get a vector with counts for just 2 buckets, then vpshufb (_mm256_shuffle_epi8) to group byte counters for the same bucket into the same qwords. Then vpsadbw (_mm256_sad_epu8) against zero can horizontal sum those byte elements within each qword for _mm256_add_epi64. So outer-loop work should be 2 vpblendw, 2x vpshufb, 2x vpsadbw, 2x vpaddq then back into another 255 iterations of the inner loop. Probably also checking if you're within 255 iterations of the end of the array to set the loop bound for the inner iteration.

Friday, November 4, 2022

There are (at least) two categories of "efficiency" to mention here:

  • UI applications (and their dependencies), where the most important measure is the response time to the user.

  • Batch processing, where the main indicator is total running time.

In the first case, there are well-documented rules about response times. If you care about product quality, you need to keep response times short. The shorter the better, of course, but the breaking points are about:

  • 100 ms for an "immediate" response; animation and other "real-time" activities need to happen at least this fast;

  • 1 second for an "uninterrupted" response. Any more than this and users will be frustrated; you also need to start think about showing a progress screen past this point.

  • 10 seconds for retaining user focus. Any worse than this and your users will be pissed off.

If you're finding that several operations are taking more than 10 seconds, and you can fix the performance problems with a sane amount of effort (I don't think there's a hard limit but personally I'd say definitely anything under 1 man-month and probably anything under 3-4 months), then you should definitely put the effort into fixing it.

Similarly, if you find yourself creeping past that 1-second threshold, you should be trying very hard to make it faster. At a minimum, compare the time it would take to improve the performance of your app with the time it would take to redo every slow screen with progress dialogs and background threads that the user can cancel - because it is your responsibility as a designer to provide that if the app is too slow.

But don't make a decision purely on that basis - the user experience matters too. If it'll take you 1 week to stick in some async progress dialogs and 3 weeks to get the running times under 1 second, I would still go with the latter. IMO, anything under a man-month is justifiable if the problem is application-wide; if it's just one report that's run relatively infrequently, I'd probably let it go.

If your application is real-time - graphics-related for example - then I would classify it the same way as the 10-second mark for non-realtime apps. That is, you need to make every effort possible to speed it up. Flickering is unacceptable in a game or in an image editor. Stutters and glitches are unacceptable in audio processing. Even for something as basic as text input, a 500 ms delay between the key being pressed and the character appearing is completely unacceptable unless you're connected via remote desktop or something. No amount of effort is too much for fixing these kinds of problems.

Now for the second case, which I think is mostly self-evident. If you're doing batch processing then you generally have a scalability concern. As long as the batch is able to run in the time allotted, you don't need to improve it. But if your data is growing, if the batch is supposed to run overnight and you start to see it creeping into the wee hours of the morning and interrupting people's work at 9:15 AM, then clearly you need to work on performance.

Actually, you really can't wait that long; once it fails to complete in the required time, you may already be in big trouble. You have to actively monitor the situation and maintain some sort of safety margin - say a maximum running time of 5 hours out of the available 6 before you start to worry.

So the answer for batch processes is obvious. You have a hard requirement that the bast must finish within a certain time. Therefore, if you are getting close to the edge, performance must be improved, regardless of how difficult/costly it is. The question then becomes what is the most economical means of improving the process?

If it costs significantly less to just throw some more hardware at the problem (and you know for a fact that the problem really does scale with hardware), then don't spend any time optimizing, just buy new hardware. Otherwise, figure out what combination of design optimization and hardware upgrades is going to get you the best ROI. It's almost purely a cost decision at this point.

That's about all I have to say on the subject. Shame on the people who respond to this with "YAGNI". It's your professional responsibility to know or at least find out whether or not you "need it." Assuming that anything is acceptable until customers complain is an abdication of this responsibility.

Simply because your customers don't demand it doesn't mean you don't need to consider it. Your customers don't demand unit tests, either, or even reasonably good/maintainable code, but you provide those things anyway because it is part of your profession. And at the end of the day, your customers will be a lot happier with a smooth, fast product than with any of those other developer-centric things.

Monday, August 29, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :