fastxor

go get github.com/lukechampine/fastxor

Is there a gaping hole in your heart that can only be filled by xor'ing byte streams at 20GB/s? If so, you've come to the right place.

fastxor is exactly what it sounds like: a package that xors bytes as fast as your CPU is capable of. For best results, use a CPU that supports a SIMD instruction set like SSE or AVX. On other architectures, performance is much less impressive, but still faster than a naive byte-wise loop.

I wrote this package to try my hand at writing Go assembly, so please scrutinize my code and let me know how I could make it faster or cleaner!

Benchmarks

AVX:

BenchmarkBytes/16-4   	200000000	         8.72 ns/op	 1835.82 MB/s
BenchmarkBytes/1024-4 	 50000000	        38.1 ns/op	26850.41 MB/s
BenchmarkBytes/65k-4  	   500000	      2738 ns/op	23930.93 MB/s

SSE:

BenchmarkBytes/16-4   	200000000	         8.63 ns/op	 1852.98 MB/s
BenchmarkBytes/1024-4 	 50000000	        39.4 ns/op	25993.00 MB/s
BenchmarkBytes/65k-4  	   500000	      2733 ns/op	23975.08 MB/s

Word-wise:

BenchmarkBytes/16-4   	100000000	        10.5 ns/op	1521.66 MB/s
BenchmarkBytes/1024-4 	 10000000	       125 ns/op	8163.59 MB/s
BenchmarkBytes/65k-4  	   200000	      6895 ns/op	9504.62 MB/s

Byte-wise:

BenchmarkBytes/16-4    	100000000	        17.3 ns/op	 925.16 MB/s
BenchmarkBytes/1024-4  	  2000000	       841 ns/op	1216.31 MB/s
BenchmarkBytes/65k-4   	    30000	     54100 ns/op	1211.38 MB/s

Conclusions: fastxor is 2-25 times faster than a naive for loop. AVX and SSE performance is roughly equivalent, which makes me suspect that I may be doing something wrong. Lastly, for very small slices, the cost of the function call starts to outweigh the benefit of AVX/SSE (the Go compiler never inlines handwritten asm). If you need to xor exactly 16 bytes (common in block ciphers), the specialized Block function outperforms the more generic Bytes:

BenchmarkBlock-4   	500000000	         3.69 ns/op	4337.88 MB/s