Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
f75c9a1
feat(parallelism): lec30 - lec34
cjeng8771 Jan 27, 2026
b9ee7ba
feat(parallelism): summary exercises precheck skel
cjeng8771 Feb 4, 2026
892e515
Merge branch 'main' into parallelism-branch
cjeng8771 Apr 8, 2026
c0b5fbd
feat(parallelism): all lec 30-34 image directives
cjeng8771 Apr 8, 2026
99bb7e4
feat(lec29): cache blocking updates
yanlisa Apr 10, 2026
500f1f5
feat(lec31): pre
yanlisa Apr 10, 2026
e959bca
Merge branch 'main' into parallelism-branch
yanlisa Apr 12, 2026
63c2243
chore: remove old cache folders, reorganize parallelism
yanlisa Apr 12, 2026
42cbf0b
feat(lec31): through flynn
yanlisa Apr 13, 2026
fcec4cb
Merge branch 'main' into parallelism-branch
yanlisa Apr 13, 2026
57f6e66
Merge branch 'main' into parallelism-branch
yanlisa Apr 14, 2026
88da555
feat(lec32): post
yanlisa Apr 15, 2026
42b0c6e
Merge branch 'main' into parallelism-branch
yanlisa Apr 15, 2026
8844249
feat(lec34): merge
yanlisa Apr 15, 2026
b73c4e7
feat(lec32): post, openmp dgemm
yanlisa Apr 17, 2026
8ab2873
feat(lec32): mimd, threads
yanlisa Apr 20, 2026
45363b8
Merge branch 'main' into parallelism-branch
yanlisa Apr 20, 2026
db6c8d1
feat(lec32): finish openmp
yanlisa Apr 22, 2026
58e2d81
Merge branch 'main' into parallelism-branch
yanlisa Apr 22, 2026
a81ad15
Merge branch 'main' into parallelism-branch
yanlisa Apr 23, 2026
169c50f
feat(lec34): locks
yanlisa Apr 27, 2026
f325c7c
Merge branch 'main' into parallelism-branch
yanlisa Apr 28, 2026
a8db215
feat(datapath): load section reformat
yanlisa Apr 28, 2026
5ba7e78
feat(lec34): cache coherency, finally
yanlisa Apr 30, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 13 additions & 19 deletions content/datapath/load-store.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,22 +31,14 @@ title: "Supporting Loads and Stores"

## Building a Processor with DMEM access

Recall that [load](#sec-load-word) instructions are [I-Type](#sec-rv-load) because they read a register, have an immediate, and write to a register a 32-bit value read from memory.

To support `lw`, we use a similar datapath to `addi` but instead compute an address with which to access DMEM.

* RegFile: We **read** _one_ register `rs1` and write one register `rd`. The value to write is a **word** read from memory.
* PC: We **read** from and **write** to PC. The value to write is `pc + 4`.
* DMEM: We **read** the memory word at address `R[rs1] + imm`.

Loads (and stores) participate in the `MEM` phase of [the five step process](#sec-five-steps). We therefore introduce additional logic connecting DMEM to the ALU and the RegFile, as shown in @fig-lw-new-blocks.
Loads and stores participate in the `MEM` phase of [the five step process](#sec-five-steps). We therefore introduce additional logic connecting DMEM to the ALU and the RegFile. @fig-lw-new-blocks describes DMEM access for load instructions.

::::{figure} images/lw-new-blocks.png
:label: fig-lw-new-blocks
:width: 100%
:alt: "Load datapath additions: DMEM connected to the ALU-computed address and a writeback mux selecting between ALU result and memory data."

DMEM: Connect and use a mux before `WB` (Write Back) phase.
For the `MEM` phase of a load instruction, conect DMEM to the ALU and use a mux before `WB` (Write Back) phase.
::::

**DMEM**: To read the memory at an address, we use the ALU to compute the address as `alu = R[rs1] + imm`. This readily reuses the circuitry for arithmetic and logical I-Type instructions.
Expand All @@ -65,6 +57,17 @@ DMEM: Connect and use a mux before `WB` (Write Back) phase.
The `lw` datapath. Use the menu bar to trace through the animation or download a copy of the PDF/PPTX file.
:::: -->

(sec-datapath-load)=
## Tracing the Load Datapath

Recall that [load](#sec-load-word) instructions are [I-Type](#sec-rv-load) because they read a register, have an immediate, and write to a register a 32-bit value read from memory.

For `lw`, we use a similar datapath to `addi`, but we use the ALU to compute an address to pass into DMEM. State updates:

* RegFile: We **read** _one_ register `rs1` and write one register `rd`. The value to write is a **word** (again, `lw`) that is read from memory.
* PC: We **read** from and **write** to PC. The value to write is `pc + 4`.
* DMEM: We **read** the memory word at address `R[rs1] + imm`.

::::{figure}
:label: anim-datapath-lw
:alt: "Embedded slides tracing address calculation, memory read, and register write-back on the single-cycle datapath for a load-word instruction."
Expand Down Expand Up @@ -105,15 +108,6 @@ We **do not** need to add additional blocks for stores, but we will need to:
* Upgrade the Immediate Generator to support immediates in S-Type instructions; we encourage you to read [that section](#sec-datapath-immgen) afterwards.
* Wire `R[rs2]` to `wdata` (DMEM input signal).

<!-- ::::{figure}
:label: anim-datapath-sw
:::{iframe} https://view.officeapps.live.com/op/embed.aspx?src=https://github.com/61c-teach/course-notes/raw/refs/heads/main/content/datapath/pptx/datapath-store.pptx
:width: 100%
:title: "Slides tracing through the `sw` Datapath."
:::
The `sw` datapath. Use the menu bar to trace through the animation or download a copy of the PDF/PPTX file.
:::: -->

::::{figure}
:label: anim-datapath-sw
:alt: "Embedded slides tracing address calculation and memory write on the single-cycle datapath for a store-word instruction."
Expand Down
130 changes: 98 additions & 32 deletions content/parallel-tlp/cache-coherency.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
---
title: "Cache Coherency"
subtitle: Coming soon. Thanks for your patience!
title: "Cache Coherence"
---

(sec-cache-coherency)=
## Learning Outcomes

<!-- * TODO
* TODO -->
* Illustrate a shared memory multiprocessor architecture with caches.
* Define cache coherence misses.

::::{note} 🎥 Lecture Video
:class: dropdown
Expand Down Expand Up @@ -39,52 +38,119 @@ subtitle: Coming soon. Thanks for your patience!

::::

<!-- ## Visuals
## Shared memory multiprocessor (SMP)

Recall that in our [multicore processor architecture](#sec-mimd) we assume a shared memory model to enable multithreaded processing. This model is called a **shared memory multiprocessor** (SMP), which assume a single physical address space across all processors.[^sap]

[^sap]: Given the shared address space, a more accurate term for shared memory multiprocessor might be shared-_address_ multiprocessor. You may also see the term symmetric multiprocessor, but we digress.

Given our understanding of the [memory hierarchy](#fig-3-memory-hierarchy) and Jim Gray's space-time analogy of [locality](#fig-3-locality), memory is a performance bottleneck even with one processor. Shared memory multiprocessors use caches to reduce bandwidths on main memory, as shown in @fig-smp.

:::{figure} images/smp.png
:label: fig-smp
:width: 65%
:alt: "Symmetric multiprocessing diagram: several cores or sockets connect through an interconnect or crossbar to a unified physical memory and optionally a shared last-level cache. Coherence control blocks or snoop filters appear on the path to show one address space visible to all processors."
:alt: "Shared memory multiprocessor diagram: several cores or sockets connect through a bus to a unified physical memory and optionally a shared last-level cache. One physical address space visible to all processors."

Shared-memory Multiprocessor (SMP) with multiple cores and a single, coherent memory.
:::

:::{figure} images/smp.png
:label: fig-smp
:width: 65%
:alt: "Symmetric multiprocessing diagram: several cores or sockets connect through an interconnect or crossbar to a unified physical memory and optionally a shared last-level cache. Coherence control blocks or snoop filters appear on the path to show one address space visible to all processors."
Notes about @fig-smp:

Shared-memory Multiprocessor (SMP) with multiple cores and a single, coherent memory.
* Each CPU has its own cache(s), e.g., an L1 cache.
* All CPUs communicate with each other and memory through a communication **bus**.
* One bank of memory (DRAM) is shared by all CPUs.

## Cache Coherence Problem

In a [different section](#sec-locks), we discuss how threads running on multiple processors can use locks to synchronize access to shared data across processors. In this section, we discuss an additional problem that arises when we introduce caching: **cache coherence**.

Consider three example memory accesses on a dual-core system. Assume the word `20` is initially in memory @ address `0x5000` , and we perform three memory accesses:

1. CPU 1 **reads** word @ address `0x5000`.
1. CPU 2 **reads** word @ address `0x5000`.
1. CPU 1 **writes** word `40` @ address `0x5000`

@fig-cache-coherence-1 shows that accesses 1 and 2, which are reads, trigger compulsory cache misses in both CPU 1's and CPU 2's caches. The two caches must request the corresponding block from memory, via the communication bus. Each processor gets a copy of this block (and therefore a copy of the word @ address `0x5000`) and stores the block on their own cache.

CPU 2 reads Mem[0x5000]
:::{figure} images/cache-coherence-1.png
:label: fig-cache-coherence-1
:width: 70%
:alt: "Visual description of a dual-core SMP system. The SMP system diagram illustrates several cores or sockets connect through a shared bus to a unified physical memory and to I/O. There are four directed arrows between each CPU's cache and the shared memory unit to show the bus access for the two memory accesses: (1) CPU 1 **reads** word @ address `0x5000`, and (2) CPU 2 **reads** word @ address `0x5000`. One set of arrows is labeled request (cache via bus to memory); the other set is labeled response (memory via bus to cache). CPU Each CPU cache has a copy of the word `20` at memory address `0x5000`.

CPU 1 and CPU 2 both read a word @ address `0x5000`. If both caches are cold, these two memory accesses are compulsory cache misses, and the value must be retrieved from shared memory via the shared bus.
:::

:::{figure} images/multiprocessor-cache.png
:label: fig-mp-cache
:width: 65%
:alt: "Multiprocessor memory system where each CPU has private caches that satisfy most loads locally, cutting traffic on the shared bus or interconnect to DRAM. Read and write arrows show hits staying on-chip while misses escalate to memory."
The issue is revealed with @fig-cache-coherence-2, which illustrates access 3, which is a write. When CPU 1 performed a write, CPU 1’s cache was up-to-date, but CPU 2’s cache is now stale, and it doesn’t know.

:::{figure} images/cache-coherence-2.png
:label: fig-cache-coherence-2
:width: 70%
:alt: "Dual-core SMP system, continued. Next, CPU 1 performs a memory write to word `40` @ address `0x5000`. Now, CPU 1's cache has a copy of the word `40` at memory address `0x5000`, but CPU 2's cache still has the word `20` at memory address `0x5000`."

Multi-processor Cache to reduce bandwidth demands on main memory.
CPU 1 performs a memory write to word `40` @ address `0x5000`. In a non-cache coherent system, CPU 1 and CPU 2 now have different copies of the same region of memory.
:::

:::{figure} images/cache-incoherence-read-write.png
:label: fig-cache-inco-read-write
:width: 65%
:alt: "Two-processor diagram with separate private caches referencing the same memory address but holding different values after unsynchronized writes. Directed arrows show each CPU’s read and write paths without a coherence protocol reconciling the copies, leading to inconsistent global memory order."
The last access in our example illustrates that this system is **not cache coherent**. From [Wikipedia](https://en.wikipedia.org/wiki/Cache_coherence):

> In a cache coherent system, if multiple clients have a cached copy of the same region of a shared memory resource, all copies are the same.

P&H defines cache coherence as the aspect that defines _what values_ can be returned by a read. There must be a way of enforcing the "coherency" implied by the phrase: "all copies are the same." We do so using an additional type of cache miss.

## Coherence Miss

:::{note} Review [cache misses](#sec-cache-misses)

Read/write flow of incoherent caches between two processors.
So far, we have discussed three types of cache misses: compulsory miss, capacity miss, and conflict miss.
:::

:::{figure} images/cache-incoherence-result.png
:label: fig-cache-inco-result
:width: 65%
:alt: "Outcome table illustrating values observed by each processor after conflicting writes when caches are incoherent: possible stale reads, divergent copies per core, and lack of a single serial order unless software flushes or barriers are used."
To enforce cache coherence, we introduce a _fourth_ type of cache miss: a **coherence miss**, e.g., a communication miss caused by writes to shared data made by other processors.

Such misses are commonly part of **cache coherence protocols**, which are means of maintaining coherence for multiple processors. For example, a protocol can ensure that a processor has "exclusive access" to a data item by invalidating copies in other caches on a write. Subsequently, a processor that reads (or writes) to an invalidated copy then misses in the cache; this miss is categorized as a coherence miss.

## Snooping Protocols

:::{warning} This section is out of scope

For more information, read P&H 5.10.

Read/write result with cache incoherence.
:::

:::{figure} images/moesi.png
:label: fig-moesi
:width: 90%
:alt: "Message-flow diagram for the MOESI protocol showing cache line states Modified, Owned, Exclusive, Shared, and Invalid, with labeled transitions for read misses, write hits, invalidations, and write-backs between processors and memory."
One version of the _write invalidate_ cache coherence protocol described above is a **snooping protocol**. When any processor accesses memory, use the bus to "snoop"[^snoop] and notify other processors.

[^snoop]: From [Merriam-Webster](https://www.merriam-webster.com/dictionary/snoop): to look or pry especially in a sneaking or meddlesome manner.

Each cache controller “snoops” for write transactions on the common bus.
On another processor's block request to the bus, check if one's own cache has a copy.

* If a copy exists, and the request is a read, do nothing.
* If a copy exists and the request is a write, then invalidate one's own cache's copy.
* If a copy does not exist, do nothing.

This snooping protocol permits many processors to have copies of data that are only read, and permits a processor that is writing to have an exclusive copy of the data (because other copies are invalidated).

### Details

:::{warning} This section is really out of scope

For more information, take an advanced computer architecture course.

:::

**MOESI** is a full cache coherence protocol that describes the states in other cache protocols: Modified Owned Exclusive Shared Invalid.

For each block in a cache, track state:

* **Shared**: up-to-date data, other caches may have a copy
* **Modified**: up-to-date data, changed (dirty), no other cache has a copy, OK to write, memory out-of-date (i.e., write back)
* **Invalid**: not in cache (from before: valid flag)

Two enhancements:

* **Exclusive**: up-to-date data, no other cache has a copy, OK to write, memory up-to-date. Avoids writing to memory if block replaced, and supplies data on read instead of going to memory.

* **Owner**: up-to-date data, other caches may have a copy (they must be in Shared state). This cache is one of several with a valid copy of the cache line, but has the exclusive right to make changes to it. It must broadcast those changes to all other caches sharing the line. The introduction of owned state allows dirty sharing of data, i.e., a modified cache block can be moved around various caches without updating main memory. The cache line may be changed to the Modified state after invalidating all shared copies, or changed to the Shared state by writing the modifications back to main memory. Owned cache lines must respond to a snoop request with data.

UC Berkeley has explored various snooping[^snoopy] protocols; see an advanced computer architecture course for more information.

MOESI Cache Protocol.
::: -->
[^snoopy]: Sometimes you will see snooping protocols called Snoopy Protocols and snooping buses called Snoopy Buses, like the [Peanuts](https://en.wikipedia.org/wiki/Peanuts) character. [source](https://people.eecs.berkeley.edu/~pattrsn/252F96/Lecture18.pdf)
Binary file added content/parallel-tlp/images/cache-coherence-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/parallel-tlp/images/cache-coherence-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file removed content/parallel-tlp/images/multiprocessor-cache.png
Binary file not shown.
Binary file modified content/parallel-tlp/images/smp.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions content/parallel-tlp/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
title: "MIMD Architectures"
---

(sec-mimd)=
## Learning Outcomes

* Identify the key components of a multicore processor: cores that execute streams of instructions independently but share memory.
Expand Down