Wednesday, June 8, 2022

Concurrency and Performance - basic scenarios & advice

Concurrency is hard. Even for those who have been through a university course of concurrency, the alignment of theoretical knowledge with implementations in mainstream languages is hard. I want to tell you about a couple of basic scenarios where mutli-threading may seen like a promising idea to increase a program's throughput and what constraints come very quickly into play, making it more difficult than it might initially seem. I want to do it in an approachable manner, so that anyone with only a basic experience in programming can understand and take advice.


 

A small experiment

We will begin with a fairly simple problem: let's say we have a list of 100 ZIP codes and for each of them we want to retrieve the information related to city, state and geographical coordinates. There is a REST API service that does exactly that:

https://api.zippopotam.us/us/{zip-code}

For example, a GET request:

https://api.zippopotam.us/us/71256

returns a small JSON object: {"post code": "71256", "country": "United States", "country abbreviation": "US", "places": [{"place name": "Lillie", "longitude": "-92.6858", "state": "Louisiana", "state abbreviation": "LA", "latitude": "32.9529"}]}

The experiment we are going to do is to try to retrieve the information for 100 ZIP codes with one thread, then with 2, then 3 and so on to see whether we can complete the retrieval faster with more threads.

We will use a small Java program to make the experiment, because Java provides a clean and easy to understand abstraction over system threads. This is what our single thread is going to do:

The threads will be started such that we can easily modify the number of threads working on our task:I ran this program on my laptop (i9 CPU), while I was connected through WiFi and there was not much else going on. Below are the results for the number of threads ranging from 1 (completely sequential execution) to 10:

 

Now, let's note down some facts:

  • the amount of information we are receiving for each ZIP is very small
  • there are no CPU intensive operations done in our program
  • most of the time is spent on sending a GET request and waiting for the response

Because of this, using 2 or 3 threads helps significantly: while one of them is waiting for the response, the other one may be sending its next request, and so on and so forth. However, threads are expensive for the operating system to manage. When we use 10 threads (and so each retrieves only 10 JSON objects), the overhead of managing them exceeds the benefits of using them. Actually, this phenomenon starts pretty quickly, we don't even have to go up to 10 threads to notice.

In more advanced scenarios, there are other constrains in play than just thread management. I'll try to visualize this with some hand drawings.

Let's assume that the data we receive is rather big. Each GET request results in some megabytes transferred and we still want to do 100 requests. Now, when adding new pipes (threads), we are not only constrained by a manageable number of threads, but also by our total network bandwidth. Our program could run on a machine strong enough that the 4th or 5th thread still adds benefit, but before we hit thread management constraint, we can hit the bandwidth constraint. Similarly, if the focus is not on data transmission but on data processing or calculations, we get something like this:


As in the previous example, the processing power constraint can hit us earlier than the thread management constraint, especially given that most of our programs are applications running in operating system user space and there are always other things going on in the system. And, finally, these two scenarios can be combined into even more involved one where i) the operations are CPU intensive and ii) the amount of data that is send and/or retrieved is big.

General advice on using threads:

  1. Be aware whether your runtime environment actually supports concurrent execution of system threads. For example, Java Virtual Machine does, but regular Python (CPython) environment does not (you can learn here why). While it is OK to use threads in CPython to handle I/O blocking operations, you will not get any benefit in terms of trying to use multiple CPU cores - the same is true for any other execution environment that uses Global Interpreter Lock (e.g. Ruby MRI).
  2. Typical scenarios where you want to continue to do something, while another thread waits on I/O can be well handled using small number of threads (for example, one or two in addition to the main thread). These scenarios are not very difficult to reason about and the chances of messing things up are not that high.
  3. Any massive use of threads can turn out tricky - make experiments and fine tune your solution based on execution time, CPU and network measurements, as appropriate. It is an involved topic and it also depends heavily on the execution environment. It is also a common source of mistakes, if many threads are used carelessly, in hope that this approach will simply speed up the execution time. Many times you will be better off:
    • leaving the execution sequential OR
    • running different chunks of what you want to do as separate processes OR
    • using a 3rd party solution where the custom development is limited to, for example, coding and injecting worker threads, but the way the synchronization and data exchange works is nicely done for you under the hood

A note on system threads and green threads

An alternative to using system threads is to use so called green threads or greenlets. An example of execution environment that features green threads is Stackless Python. Green threads are called green, because using one does not actually create a new system thread. Instead, the execution environment handles the switching of execution of all the green threads on its own, within its single system level thread. Because of this, green threads can be used in massive amounts - thousands, for example. Taking performance aspect aside, green threads can be an excellent way to simulate massive concurrency to study scientific or statistical problems.

Pictures used:

[1] A meat mincer - Creative Commons Attribution-Share Alike 2.0 Generic license

[2, 3] Author's screenshots of Eclipse IDE with explanatory comments added

[4, 5, 6] Author's hand drawings

See also