You are reading a single comment by @wence and its replies. Click here to read the full conversation.
  • There's no right or wrong. I'm talking about optimising by minimising calculations. Using the median and mean values (or floor/ceil of those) is going to be a huge optimisation on the naive algorithm of calculating every possible answer.

    No idea about Scala (haven't even looked at that solution) as I've no idea about how its memoization works.

    But, at some point, pre-calculating the values using simple addition *may* be faster than repeated multiply/add/shift. It all depends on how sparse the data is. I'd have to start to play with a dataset that is orders of magnitude greater in size to have an idea, and it's just not worth it really. The "trick" with this day was using the values near the mean/median.

  • But, at some point, pre-calculating the values using simple addition may be faster than repeated multiply/add/shift. It all depends on how sparse the data is. I'd have to start to play with a dataset that is orders of magnitude greater in size to have an idea, and it's just not worth it really. The "trick" with this day was using the values near the mean/median.

    tl:dr: I think the LUT approach is always likely to be slower.

    Assuming you did the fast approach, calculating the result is most likely dominated by streaming the data from memory, which you have to do twice. Then the inner loop needs one register for the accumulator, and one for the current value. You can save the cost of the division by pulling it out of the sum, so your inner loop does one abs, one multiply and two adds per entry (if using the gauss sum formula), this is two-four clock cycles on modern x86 (I am bit unsure about the abs). In contrast, the LUT approach does one add, and one load (which we would hope is from L1, so would have a latency of ~5-7 cycles), assuming you map positive and negative values in you LUT. So I think it’s close, but you would probably not benefit in this problem from using an LUT. That also leaves all of the cache hierarchy available for streaming the data, so you should be able to achieve the load bandwidth, which is up to two loads/cycle.

    Finally, you can only achieve the memory bandwidth limit with vector loads, if you use a LUT, you’ll have a bunch of data-dependent vector lane divergence in the inner loop, which I think will slow you down as well.

    That said, I haven’t measured anything…

About

Avatar for wence @wence started