Concurrency

%speaker
Loosely, concurrency is performing multiple tasks at once. A task is a conceptual unit of work, similar to a function but tasks don’t have to be defined on function boundaries. The granularity (or grain size) of a task is the amount of of work a task performs. The environment within which a task executes in the execution context. The hardware used to execute a task is the execution agent.

Concurrency allows for increased throughput, the amount of work that can be completed in a given amount of time, by utilizing additional available agents. Concurrency also allows for lower latency, the amount of time between when a task is submitted and when it is completed. In this chapter, we will be dealing with concurrent computation, as opposed to concurrent IO, although the two ideas are closely related.

[Slide describing latency/interactivity/responsiveness]

Forms of Concurrency

%speaker
At once can mean any combination of parallel or interleaved execution. Parallel execution is simultaneous execution using independent execution agents. Interleaved execution is reallocating agents from one task to another. The procedure of reallocating agents from one task to another is a context switch.

Interleaved execution may be scheduled in one of three different ways. Preemption is interrupting one task, usually at a given time interval or when waiting, and switching to another task. Cooperative is when a task explicitly yields or awaits at points when a context switch may occur. Queued tasks run to completion and then another task is started.

Why Concurrency Matters

Amdahl’s Law

%speaker
Amdahl’s Law shows the challenge of unlocking performance through concurrency. Even with no overhead, having only 10% of a program serialize on a 16 processor machine will only see a 6.4x speedup compared to a single core. Amdahl’s Law also shows the potential performance improvements of additional concurrency - eliminating that last 10% of serialized execution would unlock an additional 2.5x performance gain.
![](/better-code/chapters/0500-concurrency/img/amdahls-law.svg){:height="770"}

Why Concurrency Is Hard

Structure of Concurrency [dag model?]

Forking and Joining Tasks

auto r = f(x) * g(x);

Forking and Joining Tasks

auto g_ = fork([&]{ return g(x) }]); // execute concurrently
auto f_ = f(x);
auto r = f_ * join(g_);

Threads

Thread Example

// Fork
decltype(g(x)) g_;
std::thread thread{[&]{
    g_ = g(x);
}};

auto f_ = f(x);

// Join
thread.join();

auto r = f_ * g_;

Cost of Threads

%speaker
The cost of thread context switches is a combination of kernel calls which require switching between protection rings, and cache invalidation

Wired memory is not paged-out by the VM system. If memory is under pressure a thread, even if idle, imposes a performance penalty on the system

Thread Context Switches

Thread Pools

Communicating Between Concurrent Tasks and Races

%speaker
A race condition is when an order of effects is required for correct execution but that ordering is not guaranteed.

Thread context switches are expensive because modern processors have dedicated memory caches per core. For the results of a core computation to be visible to another core requires a memory fence. A memory fence establishes an ordering of memory load and store operations. A memory fence must be understood by the processor and the compiler. If the compiler is not aware of a memory fence, it could reorder an operation so two threads would see inconsistent results.

An evaluation that writes to a memory location while another evaluation reads or writes the same memory location is a data race unless both are atomic operations or one of the conflicting operations happens-before as established with a memory fence. The result of a data race is undefined behavior.

template <typename T>
class bad_cow {
    struct object_t {
        explicit object_t(const T& x) : data_m(x) {}
        atomic<int> count_m{1};
        T           data_m; };
    object_t* object_m;
 public:
    explicit bad_cow(const T& x) : object_m(new object_t(x)) { }
    ~bad_cow() { if (0 == --object_m->count_m) delete object_m; }
    bad_cow(const bad_cow& x) : object_m(x.object_m) { ++object_m->count_m; }

    bad_cow& operator=(const T& x) {
        if (object_m->count_m == 1) object_m->data_m = x;
        else {
            object_t* tmp = new object_t(x);
            --object_m->count_m;
            object_m = tmp;
        }
        return *this;
    }
};

Waiting, Locking, and Deadlocks

std::atomic_flag done;
decltype(g(x)) g_;

// fork
system_thread_pool([&]{
    g_ = g(x);
    done.test_and_set();
    done.notify_one();
}]);

auto f_ = f(x);

done.wait(false); // join (deadlock?)
auto r = f_ * g_;

Continuations and Futures

    // fork
    std::future<decltype(g(x))> g_ = async(system_thread_pool, [&]{ return g(x); });

    auto f_ = f(x);

    // join (deadlock?)
    return f_ * g_.get();
    return stlab::when_all(stlab::async(system_scheduler, [=]{ return g(x); }),
                           stlab::async(system_scheduler, [=]{ return f(x); }) |
            [](const auto& a, const auto& b){ return a * b; });
    return stlab::when_all(stlab::async(system_scheduler, [=]{ return g(x); }),
                           stlab::async(system_scheduler, [=]{ return f(x); }) |
            [](const auto& a, const auto& b){ return a * b; });

Cancellation

Sender/Receiver Model

Serial Queues and Actors

Channels

Coroutines and Fibers

The Exit Problem

Closing thoughts

    return f(x) * g(x);
// Local Variables: // js-indent-level: 2 // End: