3

I need to find the closest match between two byte arrays by Euclidean distance as fast as possible.

I have tested this code so far.

byte[] hash1 = new byte[200];
byte[] hash2 = new byte[200];

int distanceSquared = 0;
int diff;

for (int i = 0; i < 200; i++)
{
    diff = hash1[i] - hash2[i];
    distanceSquared += diff * diff;                
}

Can I speed up this code somehow?

Dmitry Bychenko
  • 149,892
  • 16
  • 136
  • 186
andy.e
  • 160
  • 8
  • 1
    *This* code, with two 200-element arrays, [probably not](https://stackoverflow.com/a/12407655/11683). With much bigger arrays, probably yes with `Parallel.For`. – GSerg May 17 '19 at 08:42
  • Possible duplicate of [calculate the Euclidean distance between an array in c# with function](https://stackoverflow.com/questions/34698649/calculate-the-euclidean-distance-between-an-array-in-c-sharp-with-function) – Victor Manuel May 17 '19 at 08:44
  • @VictorManuel Not really, no. The question is not how to calculate the distance, the question is how to do it faster. – GSerg May 17 '19 at 08:47
  • The only thing I see is maybe you could use short instead of int for the diff variable but this is an extremely small improvement. – Joelius May 17 '19 at 08:48
  • 1
    This code doesn't actually compare anything... shouldn't it compare and keep the diffs, rather than adding to distanceSquared? – Nyerguds May 17 '19 at 08:48
  • @Gserg i see answer of my link a fast way to calculate it,at least faster than his posted code – Victor Manuel May 17 '19 at 08:51
  • Why do you care? Did you encounter any grave problems concerning performance or are you just haunting for nano-seconds? Remember: [premature optimization is the root of all evil](http://wiki.c2.com/?PrematureOptimization). – HimBromBeere May 17 '19 at 08:51
  • 1
    Also, if you are comparing 2 single-value arrays, rather than 2 sets of 2D or 3D coordinates, then the distance is always linear, and you don't need any square, just an absolute value of the diff. – Nyerguds May 17 '19 at 08:51
  • 1
    Use hash1.Length instead of 200 and the jitter optimizer can eliminate half of all the array bounds checks. Do that first since it is simple and correct and to ensure you are actually ahead and it is not the memory access that is the throttle. Next you can eliminate the other half by making it unsafe with the fixed keyword. Next you can use the System.Numerics.Vectors nuget package to parallelize it with SIMD. Where to stop because it is already good enough is up to you when you don't post perf requirements. – Hans Passant May 17 '19 at 09:56

2 Answers2

3

You can vectorize with System.Numerics.Vectors... the ugliest bit here is the need to "widen" from byte through to int to avoid rounding problems, but... it works more than twice as fast:

Basic: 2313122, 58ms
Vectorized: 2313122, 18ms

code:

using System;
using System.Diagnostics;
using System.Numerics;
using System.Runtime.InteropServices;

static class Program
{

    static void Main()
    {
        int len = 200;
        byte[] hash1 = new byte[len];
        byte[] hash2 = new byte[len];

        var rand = new Random(123456);
        rand.NextBytes(hash1);
        rand.NextBytes(hash2);

        Run(nameof(Basic), Basic, hash1, hash2);
        Run(nameof(Vectorized), Vectorized, hash1, hash2);
    }

    static void Run(string caption, Func<byte[], byte[], int> func, byte[] x, byte[] y, int repeat = 500000)
    {
        var timer = Stopwatch.StartNew();
        int result = 0;
        for (int i = 0; i < repeat; i++)
        {
            result = func(x, y);
        }
        timer.Stop();
        Console.WriteLine($"{caption}: {result}, {timer.ElapsedMilliseconds}ms");
    }

    static int Basic(byte[] hash1, byte[] hash2)
    {
        int distanceSquared = 0;
        for (int i = 0; i < hash1.Length; i++)
        {
            var diff = hash1[i] - hash2[i];
            distanceSquared += diff * diff;
        }
        return distanceSquared;
    }
    static int Vectorized(byte[] hash1, byte[] hash2)
    {
        int start, distanceSquared;
        if (Vector.IsHardwareAccelerated)
        {
            var sum = Vector<int>.Zero;
            var vec1 = MemoryMarshal.Cast<byte, Vector<byte>>(hash1);
            var vec2 = MemoryMarshal.Cast<byte, Vector<byte>>(hash2);

            for (int i = 0; i < vec1.Length; i++)
            {
                // widen and hard cast needed here to avoid overflow problems
                Vector.Widen(vec1[i], out var l1, out var r1);
                Vector.Widen(vec2[i], out var l2, out var r2);
                Vector<short> lt1 = Vector.AsVectorInt16(l1), rt1 = Vector.AsVectorInt16(r1);
                Vector<short> lt2 = Vector.AsVectorInt16(l2), rt2 = Vector.AsVectorInt16(r2);
                Vector.Widen(lt1 - lt2, out var dl1, out var dl2);
                Vector.Widen(rt1 - rt2, out var dr1, out var dr2);
                sum += (dl1 * dl1) + (dl2 * dl2) + (dr1 * dr1) + (dr2 * dr2);
            }
            start = vec1.Length * Vector<byte>.Count;
            distanceSquared = 0;
            for (int i = 0; i < Vector<int>.Count; i++)
                distanceSquared += sum[i];
        }
        else
        {
            start = distanceSquared = 0;
        }
        for (int i = start; i < hash1.Length; i++)
        {
            var diff = hash1[i] - hash2[i];
            distanceSquared += diff * diff;
        }
        return distanceSquared;
    }
}
Marc Gravell
  • 927,783
  • 236
  • 2,422
  • 2,784
  • Side note: it would be worth trying `distanceSquared += (4 of Vector.Dot)` instead of the vectorized multiply and sum; edit: tried that, it gets slower (34ms vs 18ms for the code shown above) – Marc Gravell May 17 '19 at 11:06
  • This looks very promising. Thank you very much for the help. I will get back to you when I have tested the solution. – andy.e May 17 '19 at 13:16
  • Thanks for the implementation, maybe you can use (if you have enough time) BenchmarkDotNet to compare both basic and vectorized versions. – ganchito55 May 18 '19 at 13:03
  • @ganchito55 feel free to :) I think I'm done on this one for today, though – Marc Gravell May 18 '19 at 19:59
  • Thanks, I confirmed the results with BenchmarkDotNet as well | Method | Mean | Error | StdDev | |----------- |----------:|----------:|----------:| | Basic | 166.94 ns | 0.8516 ns | 0.7549 ns | | Vectorized | 65.08 ns | 0.2805 ns | 0.2487 ns | – andy.e May 20 '19 at 14:49
2

If you use .NET Core 3 (now it's a preview but it's near to RC), you use Hardware intrinsics to speed up you calculation. For example Microsoft use it to accelerate machine learning operations

You do this operation diff = hash1[i] - hash2[i]; using: VPSUBB hardware instruction. Then change distanceSquared += diff * diff; to PMADDUBSW hardware instruction.

This should be the fastest way, maybe you should investigate other hardware instructions. I hope this can help you.

ganchito55
  • 3,350
  • 4
  • 27
  • 41
  • 1
    got it working \o/ https://stackoverflow.com/a/56184837/23354 - note, you need to be careful about the issue of overflow when dealing with multiplying bytes and summing as integers – Marc Gravell May 17 '19 at 10:45
  • @MarcGravell thanks for you comment I didn't know that the ``System.Numerics`` namespace had support for vectorized operations. Kudos – ganchito55 May 18 '19 at 13:05
  • 1
    Only in the .Vectors package, bit yeah. It works well. – Marc Gravell May 18 '19 at 20:00