When it comes to making things "faster", an easy thing to reach for is concurrency. GCD and NSOperations make dealing with threads a breeze and if your task lends itself to concurrency then it makes a lot of sense.
The thing to remember is, once you've introduced concurrency into a system, it's your responsibility to make sure your data stays in sync. Depending on the application, not doing so could result in hard to debug problems ranging from inconsistent UI to catastrophic data loss.
Since we want to avoid all of those things, it's best to go into multi-threading with synchronization in mind from the start.
That being said, it's also important to remember that ensuring correctness via these strategies will necessarily slow down your code. If you've decided to make a part of your app concurrent, make sure to test the performance to make sure the concurrency you've added actually makes things faster.
Like anything, you have a couple options when it comes to keeping your code in sync. To get a feel for how usage looks, we'll look at each of these mechanisms in the context of building out a simple cache. The cache is just a wrapper around NSDictionary that makes it usable from different threads.
We'll compare how each performs while inserting 10,000 objects into the dictionary. The scores are given in average nanoseconds per insert.
One thing to take note of before we jump in is the documentation for the dispatch_benchmark function you can use to do benchmarking.
- Code bound by computational bandwidth may be inferred by proportional changes in performance as concurrency is increased.
- Code bound by memory bandwidth may be inferred by negligible changes in performance as concurrency is increased.
- Code bound by critical sections may be inferred by retrograde changes in performance as concurrency is increased.
Interestingly, filling this dictionary with 10,000 elements is actually faster if we just do it serially without any multi-threading. Therefore we can be relatively sure that changes in speed we see are related to the differences in these locking strategies. To be fair, there's not a lot of other code it could be due to, but something to keep in mind for when you're looking at more complex situations.
First up is the basic NSLock object. To use an NSLock, all you need to do is make sure to initialize it with your object, and then when you want to use it to lock on a critical section, you can use the -lock and -unlock methods.
As you can see, its usage is fairly straightforward. Here, we're calling -lock before accessing the dictionary and then -unlock after we're done.
Avg Insert: 7,944 ns
Next is the @synchronized structure available to you in Objective-C. This one's an oldie but a goodie. Instead of having to worry about creating extra objects for locking, you can use the object you want to guard as the object to lock on.
This is useful if you have a specific object you're trying to synchronize access to, but if you have a more ambiguous chunk of code you need to lock on, then you can just as easily synchronize on self instead.
Avg Insert: 7,936 ns
C++ Scoped Mutexes
This is the type of mutex most often found in Texture. You do have to go through the trouble of implementing it yourself, but once you have, the syntax is actually really nice.
These mutexes work by locking when they're initialized and then unlocking when they go out of scope. Since they're initialized on the stack, this means when the method returns. This is nice because you don't have to worry about extra braces or matching -lock and -unlock calls. If you want to apply the lock to a smaller section, just wrap it, and the critical section, in curly braces to limit it to that scope.
Avg Insert: 7,739 ns
Internal Queues with GCD
Finally, we have the most lauded of options, the GCD serial queue. I've heard multiple times now that this is the *right* way to ensure your code is synchronized, but it took me a while to sit down and look at how to do it.
Turns out it's not too crazy.
Since all a lock is really doing is ensuring things happen in order instead of threads stepping on each others' toes, you can accomplish your aim with a custom dispatch queue.
Once you have your queue ready, you just need to make sure any access to your protected data structure happens in the context of a dispatch to your serial queue.
Since the access does happen in a block, you'll need to make a __block pointer that can be assigned to, synchronously dispatch the dictionary access to the queue, and then return the result after it was run on the queue.
Avg Insert: 6523 ns
Here we see our fastest time so far. I was honestly skeptical of this but it does seem to be the case. I always just kind of assumed the extra work under the hood of creating a block struct with captured arguments and dispatching it to a separate work queue would cost more than these other options but I guess that's why it's good to actually take some measurements.
One extra neat thing is that writing objects can actually be done asynchronously since there's no reason to wait on it to finish from the writing thread. The dictionary access will still get put in line in the queue and is guaranteed to happen before any subsequent fetches.
Concurrent Readers/Single Writer
Let's say you've decided that you want your data structure to allow concurrent reads without locking. In our current situation, NSMutableDictionary is thread-safe for reads, but you still need to lock during writing to avoid reading corrupted data mid-write. This is a pretty common optimization and can make things go a lot faster if you have a lot more reads than writes.
The really cool thing about using queues for synchronization is how easy it is to tweak things a bit to get what we want. Instead of a serial queue, you can use a concurrent queue.
Then, just update your write code to use the dispatch_barrier class of functions. Dispatching to a queue with a barrier means that the dispatched block will wait for all in-progress blocks to complete and then block any subsequent blocks until it has finished.
This means that blocking will only occur around writes like we were hoping. All reading threads will get their answer back synchronously without needing to wait on each other due to locks!
Now, this can of course be accomplished in other ways, but it doesn't really come for free with any of the other strategies we've looked at.
As always, your use cases may differ and the findings here may not reflect reality in your app. If your code is sensitive enough you should always be taking measurements yourself to see what works best for your use-case.
One thing I am a little curious about is if the performance of dispatch queues suffers as the blocks dispatched to them start capturing more and more objects.
If you want to pull down and play around with the code, here's a small test project on github. If your'e still a little fuzzy on why we need any of this, a kind of fun to try is to run the tests but disable all of these mutexes. You're pretty much guaranteed to get a cryptic crash involving deallocing data before it was allocated. The nice thing about doing tests like this is that it will show you just how resilient your code is to being used in a multi-threaded environment.
Have you played around with any strategies that blow these out of the water? Just wanna tell me my measurements are st00pid? Go ahead and let me know down in the comments.