We spend so much time drilling algorithmic complexity. Big-O and all that. But performance is so often about contention and memory, especially when working in parallel.
I was just working on a program that does Monte Carlo simulation. That means running the same algorithm over the same data thousands of times, with some amount of injected randomness. My single-threaded approach was taking 40 seconds, and I wanted to make it faster. Make is parallel!
I tried all kinds of scaling factors, and unsurprisingly the best was 8-way on a 10-core system. It got me down to…50 seconds?!?!? Yes, slower. Slower? Yes. Time to pull out Instruments.
My first mistake was trying to make it parallel before I pulled out Instruments. Always start by profiling. Do not make systems parallel before you’ve optimized them serially. Sure enough, the biggest bottleneck was random number generation. I’d already switched from the very slow default PRNG to the faster GKLinearCongruentialRandomSource. The default is wisely secure, but slow. The GameKit PRNGs are much faster, but more predictible. For Monte Carlo simulation, security is not a concern, so a faster PRNG is preferable. But it was still too slow.
Why? Locking. GKLinearCongruentialRandomSource has internal mutable state, and is also thread-safe. That combination means locks. And locks take time, especially in my system that generates tens of millions of random values, so there is a lot of contention.
Solution: make the PRNG a parameter and pass it in. That way each parallel task gets its own PRNG and there’s no contention. At the same time, I switched to a hand-written version of xoshiro256+ which is specifically designed for generating random floating-point numbers. Hand-writing my own meant that I know what it does and can manage locking. (I actually used a struct that’s passed
inout rather than locking. I may test out a class + OSAllocatedUnfairLock to see which is faster.)
Anyway, that got it down to 30s (with 8-way parallelism), but still far too slow. Using 8 cores to save 25% is not much of a win. More Instruments. Huge amounts of time were spent in retain/release. Since there are no classes in this program, that might surprise you, but copy-on-write is implemented with internal classes, and that means ARC, and ARC means locks, and highly contended locks are the enemy of parallelism.
It took a while to track down, but the problem was roughly this:
bigObject includes some arrays (thus COW and retain/release) and includes the object that is being updated. Everything is a struct, so there’s definitely going to be a copy here as well. I rewrote
update and all the other methods to take two integer parameters rather than one object parameter and cut my time down to 9 seconds.
Total so far from cleaning up memory and locks: >75% improvement.
Heaviest remaining stack trace that I’m digging into now:
swift_allocObject. It’s always memory…