Each atomic fetches and modifies a given cache line (“read-modify-write”).
Invalidates (flushes) the processor’s internal caches and issues a special-function bus cycle that directs external caches to also flush themselves.
Atomic operations (atomics) such as Compare-and-Swap (CAS), Fetch-and-Add (FAA), and Swap (SWP) are ubiquitous in parallel programming.
Yet, performance tradeoffs between these operations and various characteristics of parallel systems, such as the structure of caches, are unclear and have not been thoroughly analyzed.
It's about understanding how the underlying hardware operates and programming in a way that works with that, not against it.
We get a number of comments and questions about the mysterious cache line padding in the Ring Buffer, and I referred to it in the last post.
In benchmarks, we consider the following parameters and aspects: The design of atomics prevents any instruction-level parallelism even if there are no dependencies between the issued operations (in the paper, we discuss ways to alleviate it in future architectures).
In our work, we illustrate that a simple performance model that takes into account the cache coherence state of the accessed cache line is enough to account for most performance results.
write-through is lower but cleaner (memory always consistent), write-back is faster but complicated when multi cores sharing memory, requiring cache coherency protocol.
TLBs are small (maybe 64 entries), fully-associative caches for page table entries.
Cache maintenance operations are defined to act on particular memory locations.