Sunday, December 17, 2017

How Does Task in C# Affect Performance?

When I talk about using Task (and await -- which is a wrapper around Task), often there's a question about performance. The concurrent nature of Task means that there is some overhead. We don't get these things for free.
The good news is that the cost is negligible for the types of things we generally deal with in user space.
To test this out, I did some experimentation with my Digit Recognizer application (GitHub project). In that project, I create a large number of Tasks in order to maximize the usage of the CPU cores on my machine.

There are a couple of caveats with regard to this particular project:
  1. The parallel tasks are CPU-bound, meaning the CPU speed is the constraint. This is different from tasks that are I/O-bound (whether waiting for a file system, network, or database).
  2. The number of tasks is a "reasonable" size. The definition of reasonable will vary based on the application, but we'll see this as we take a closer look at the code.
The Existing Application
This is a machine learning application that is designed to try out different algorithms for recognizing hand-written digits. Here's a typical output (from the master branch of the project):


This shows 2 algorithms side-by-side. On the left is using Manhattan distance and on the right using Euclidean distance. As we can see, the Euclidean distance takes a bit longer, but it is more accurate.

The way that I maximize CPU usage is to split each item into a separate task. That allows me to use all 8 cores on my machine.

Here's the code to run things in parallel (from RecognizerControl.xaml.cs in the DigitDisplay project):


This uses a bit of a brute-force method to running things in parallel. The string array that comes into this method represents the items we're trying to recognize (each string is a separate digit - you can read more about it here: Displaying Images from Pixel Data).

The "foreach" loop grabs each individual string and creates a Task that will run it through the classifier. The "Recognizer.predict()" function is the one that does the heavy lifting.

After kicking off the task (Task.Run creates a Task and runs it as soon as possible), we add a continuation (in the "ContinueWith()" method). This runs the "CreateUIElements()" method that will create a bitmap, pair it with the prediction result, and then display it in the WrapPanel of our UI. This uses the "TaskScheduler.FromCurrentSynchronizationContext" to get back to the UI thread. You can learn more about this from my articles and videos about Task: Task, Await, and Asynchronous Methods.

The last line of the method ("Task.WhenAny()") is there to reset the timer appropriately. This is how we get the "Duration" on the screen.

A Lot of Tasks
This generates *a lot* of Tasks. A Task is created for each item (242 for each classifier), and these are all told to "start now". It's then up to the Task Scheduler to figure out when and where to run each of these tasks.

So I'm creating 484 tasks, and saying "run these when you can".

That can't be good for performance, right?

It runs faster than when it was single-threaded (and that makes a whole lot of sense). But what is it really doing for performance?

Performance Test
I was talking to someone about Tasks and performance, and it made me curious as to how creating all of these Tasks impacts the performance of this application.

I figured that if I stripped away the CPU-intensive processing, what we would have left would be the overhead of using Tasks this way. So I started by creating a "Null Classifier" that didn't to any work.

Initial results were promising:


If you'd like to see this code, then you'll need to switch over to the "NullClassifier" branch of the GitHub project: https://github.com/jeremybytes/digit-display/tree/NullClassifier.

Here's the very complex code for the Null Classifier (from "FunRecognizer.fs" in the "FunctionalRecognizer" project):


The "nullClassifier" takes in an integer array, and always returns "0". This doesn't do any processing, but it's also not very accurate :)

To get a clearer picture of what was going on, I also increased the number of records from 242 to 5,000 (which is actually 10,000 records since we have 2 classifiers). In the results, we can see that the "Manhattan Classifier" took 180 seconds to finish; the "Null Classifier" took 6 seconds to finish.

That implies that there is 6 seconds of overhead for using Task this way.

But that's also a bit misleading.

A More Accurate Test
The 6 seconds didn't seem like that much (relatively-speaking), but it also seemed a bit high for what was going on. But it turned out, I wasn't eliminating all of the processing.

A big part of the processing was creating the bitmaps and putting them into the WrapPanel of the UI. So I went through and commented-out the code that generated the UI elements and ran my test again. This time, the results were pretty surprising (even to me):


This processed the same number of records: 5000. The "Manhattan Classifier" took 164 seconds to complete, so it was a little bit faster. But the "Null Classifier" took no time at all (well, less than a second).

At first, I thought I may have taken out a bit too much code. Maybe the continuations weren't really running. But each record was processed, and we can see that the error count is the same as before: 4515. The error count is incremented in the continuation, so I knew that was still running.
So by eliminating the CPU-intensive operation and the UI-intensive code, I found that the Task overhead is negligible for this type of work.
That was good news to me. Performance is way better than single-threading (3 to 6 times better depending on the machine and the number of cores).

Caveats
It's very easy to mess up parallel programming. If we have interactions between the processes, things can get ugly very quickly. This code was designed to be parallelized from the ground up. This means that I made sure that each Task could be isolated and run without outside dependencies.

I also have a CPU-bound operation. This means that by using Task this way (and letting the Task Scheduler do its job of figuring out the best way to mete out the jobs), I can take advantage of multi-core machines. In fact, I just got a new laptop with 8 cores, and this code runs about twice as fast as on the 4 core machine I was using before.

I'm also running this on a reasonable number of records. In typical usage, this application would process 600 to 800 records -- the number that would fit on the screen. Even when I increased that 10-fold, I had good performance. If I were trying to process hundreds of thousands of records, I would expect this to break down pretty quickly.
For different scenarios, I'd definitely take a look at the advice in Parallel Programming with Microsoft .NET (my review). It has different patterns that we can follow depending on what our needs are.
Update: If you'd like to contribute your own ideas to improving this code, take a look at the next article which describes the application goals: Your Ideas Needed: Other Ways to Run in Parallel. Feel free to submit a pull request.

Wrap Up
I really like using Task. Whenever I tried to do my own threading, bad things would generally happen. That's why I've been a big fan of the BackgroundWorkerComponent in the past and of Task and await today. These abstractions give us a higher-level way of dealing with these concurrent processes. And we can leave it up to the compiler and .NET run-time designers to take care of the details. I'm happy to leave things to people who know way more than I do.

If you want to explore some more, be sure to take at the articles in the README.md of the "digit-display" project on GitHub, and also check out the articles and videos on getting started with Task and await.

Happy Coding!

11 comments:

  1. Task is a great abstraction and as you've discovered, Task is also fairly lightweight.

    I was planning to use Tasks heavily and wanted to see the cost of using them excessively and found them to be quite practical.

    They have a small memory overhead and CPU cost, but not enough to worry about for most use cases.

    I wrote a blog post detailing my results earlier this year: https://www.danielcrabtree.com/blog/248/maximizing-throughput-the-overhead-of-1-million-tasks

    ReplyDelete
  2. For miriads of async you should probably try Hopac. It is outperform Task by a mile:
    http://vaskir.blogspot.ru/2015/01/parallel-reduce-hopac-vs-asyncs-vs-tasks.html

    ReplyDelete
  3. I'd be interested in seeing if there's any performance benefit to using Parallel (with its partitioning and multiple-work-items-per-task system) as opposed to throwing all the tasks on the thread pool at once. Probably negligible in this case, wince the task overhead is so slow to start with.

    ReplyDelete
    Replies
    1. I originally tried using a Parallel.ForEach, but ran into problems with getting stuff back to the UI thread. I ended up taking the brute force approach with Task to get the UI to update as the items were processed. It worked, so I didn't think too much about it for this application. I'll have to go take another stab at using Parallel.

      Delete
    2. Since I don't see any I/O that these tasks are performing, I also thought that it might be more appropriate to use Parallel. After UI thread dispatching issues solved.

      Delete
    3. I took another run at using Parallel, but didn't have much luck. I'd like the UI to get populated as things are done processing, but I haven't found the right combination for that. You can read more about it (and contribute some code if you'd like): https://jeremybytes.blogspot.com/2017/12/your-ideas-needed-other-ways-to-run.html

      Delete
  4. I had a similar problem with my fractal generator https://www.microsoft.com/en-us/store/p/fractal-map/9nblggh444sr. Instead of creating a task for each calculation, I split the fractal into 8 equal pieces (because I have 8 cores) and created a task for each piece. Now i only have 8 tasks. I found that this ran faster. If running on less than 8 cores, the task scheduler will still manage the task appropriately. Of course, I was initially generating a much higher number of tasks because I have calculation for each pixel, which was giving me over 2 million tasks!

    ReplyDelete
    Replies
    1. Yep, this application is nowhere near 2 million tasks. Although I have thought of a massive version of Conway's Game of Life that would need something similar. Thanks for sharing. (BTW, I love fractals)

      Delete
    2. As I said above, try Hopac Jobs. They do excatly what you want, but it is already implemented by someone else in great .Net framework.
      Hopac utilize its own scheduler, ultra lightweigth primitives for asynchronous computations (called Job) and power of selective asynchronous programming with negative acknowledge etc. It scales with millions jobs and dont even start to sweat.

      Delete
  5. Would be cool if you could use a syntax highligher rather than posting code as an image, if this platform supports that

    ReplyDelete