Concurrency

%speaker: Loosely, concurrency is performing multiple tasks at once. A task is a conceptual unit of work, similar to a function but tasks don’t have to be defined on function boundaries. The granularity (or grain size) of a task is the amount of of work a task performs. The environment within which a task executes in the execution context. The hardware used to execute a task is the execution agent.

Concurrency allows for increased throughput, the amount of work that can be completed in a given amount of time, by utilizing additional available agents. Concurrency also allows for lower latency, the amount of time between when a task is submitted and when it is completed. In this chapter, we will be dealing with concurrent computation, as opposed to concurrent IO, although the two ideas are closely related.

Loosely, performing multiple tasks at once
Allows increased throughput through better hardware utilization
Allows decreased latency
A task is a conceptual unit of (serial) work
- May be composed of other tasks
- Executes within an execution context
- The hardware used to execute a task is the execution agent
Concurrency is about having multiple task in flight at once

[Slide describing latency/interactivity/responsiveness]

Forms of Concurrency

%speaker: At once can mean any combination of parallel or interleaved execution. Parallel execution is simultaneous execution using independent execution agents. Interleaved execution is reallocating agents from one task to another. The procedure of reallocating agents from one task to another is a context switch.

Interleaved execution may be scheduled in one of three different ways. Preemption is interrupting one task, usually at a given time interval or when waiting, and switching to another task. Cooperative is when a task explicitly yields or awaits at points when a context switch may occur. Queued tasks run to completion and then another task is started.

At once can mean any combination of:
- Parallel (simultaneous) execution, with tasks using independent hardware
- Interleaved
  - A context switch is the procedure of reallocating hardware resources from one task to another
  - Preemptive: the system decides when to reallocate hardware to tasks
    - Provides illusion of parallelism
  - Cooperative: tasks yield or await when context switches may occur
  - Queued: tasks are run to completion and then another task is started

Why Concurrency Matters

Moore’s law ends: Free Lunch is Over
- Increasing difficult to push single core performance so more cores
- To continue to improve throughput and latency requires concurrency
- Latency is a key measure for interactivity
I/O throughput and latency
- Increased reliance on cloud technology
Heterogeneous compute
- Special-purpose agents (GPUs, Neural Engines, PGAs, specialized components)
- Asymmetric cores

Amdahl’s Law

%speaker: Amdahl’s Law shows the challenge of unlocking performance through concurrency. Even with no overhead, having only 10% of a program serialize on a 16 processor machine will only see a 6.4x speedup compared to a single core. Amdahl’s Law also shows the potential performance improvements of additional concurrency - eliminating that last 10% of serialized execution would unlock an additional 2.5x performance gain.

![](/better-code/chapters/0500-concurrency/img/amdahls-law.svg){:height="770"}

Why Concurrency Is Hard

Reasoning about the visibility of effects is hard
then is a 4-letter word

Structure of Concurrency [dag model?]

Forking and Joining Tasks

Using pure functions to represent tasks, consider the following:

auto r = f(x) * g(x);

f() and g() can be executed concurrently
- Forking is initiating a concurrent task
Both f() and g() must complete before operator*() can be applied
- Joining is the process of receiving information, such as the result, of a concurrent task

Forking and Joining Tasks

This might be represented as:

auto g_ = fork([&]{ return g(x) }]); // execute concurrently
auto f_ = f(x);
auto r = f_ * join(g_);

There are many different models for fork() and join()

Threads

Provide a unified view of concurrency and parallelism, preemptively scheduled
Threads are the primary way to access parallelism from within a single program
- OS processes provide concurrency and parallelism across programs

Thread Example

// Fork
decltype(g(x)) g_;
std::thread thread{[&]{
    g_ = g(x);
}};

auto f_ = f(x);

// Join
thread.join();

auto r = f_ * g_;

Cost of Threads

%speaker: The cost of thread context switches is a combination of kernel calls which require switching between protection rings, and cache invalidation

Wired memory is not paged-out by the VM system. If memory is under pressure a thread, even if idle, imposes a performance penalty on the system

Creation
- Constructing and joining a thread ≈ 60,000 cycles
  - Additional cost of constructing and destructing all thread-local variables
- Memory ≈ 0.5 MB for stack, memory is often wired
Context switch
- Direct cost ≈ 2,000 cycles
- Total cost including cache invalidation ≈ 10,000 - 1,000,000 cycles

Thread Context Switches

Context switches occur when there are more threads than execution agents (cores)
- Context switches occur at intervals
- Blocking the current thread (such as with join() will allow a context switch
Ideal is one active thread per core

Thread Pools

A thread pool typically has one thread per core
- The OS often provides a system thread pool (Apple and Microsoft)
Tasks are queued to the pool
Cost of queue is ≈ 100-500 cycles
Since threads are joined at task completion, another mechanism is needed to communicate the result

Communicating Between Concurrent Tasks and Races

%speaker: A race condition is when an order of effects is required for correct execution but that ordering is not guaranteed.

Thread context switches are expensive because modern processors have dedicated memory caches per core. For the results of a core computation to be visible to another core requires a memory fence. A memory fence establishes an ordering of memory load and store operations. A memory fence must be understood by the processor and the compiler. If the compiler is not aware of a memory fence, it could reorder an operation so two threads would see inconsistent results.

An evaluation that writes to a memory location while another evaluation reads or writes the same memory location is a data race unless both are atomic operations or one of the conflicting operations happens-before as established with a memory fence. The result of a data race is undefined behavior.

template <typename T>
class bad_cow {
    struct object_t {
        explicit object_t(const T& x) : data_m(x) {}
        atomic<int> count_m{1};
        T           data_m; };
    object_t* object_m;
 public:
    explicit bad_cow(const T& x) : object_m(new object_t(x)) { }
    ~bad_cow() { if (0 == --object_m->count_m) delete object_m; }
    bad_cow(const bad_cow& x) : object_m(x.object_m) { ++object_m->count_m; }

    bad_cow& operator=(const T& x) {
        if (object_m->count_m == 1) object_m->data_m = x;
        else {
            object_t* tmp = new object_t(x);
            --object_m->count_m;
            object_m = tmp;
        }
        return *this;
    }
};

Waiting, Locking, and Deadlocks

std::atomic_flag done;
decltype(g(x)) g_;

// fork
system_thread_pool([&]{
    g_ = g(x);
    done.test_and_set();
    done.notify_one();
}]);

auto f_ = f(x);

done.wait(false); // join (deadlock?)
auto r = f_ * g_;

Continuations and Futures

    // fork
    std::future<decltype(g(x))> g_ = async(system_thread_pool, [&]{ return g(x); });

    auto f_ = f(x);

    // join (deadlock?)
    return f_ * g_.get();

    return stlab::when_all(stlab::async(system_scheduler, [=]{ return g(x); }),
                           stlab::async(system_scheduler, [=]{ return f(x); }) |
            [](const auto& a, const auto& b){ return a * b; });

    return stlab::when_all(stlab::async(system_scheduler, [=]{ return g(x); }),
                           stlab::async(system_scheduler, [=]{ return f(x); }) |
            [](const auto& a, const auto& b){ return a * b; });

Cancellation

Sender/Receiver Model

Serial Queues and Actors

Channels

Coroutines and Fibers

The Exit Problem

Closing thoughts

If we look at our original problem statement

    return f(x) * g(x);

If we know that f() and g() do not share mutable state that is enough to execute these concurrently
- Profile guided optimization could determine efficient grain sizes