L-Cache Performance Experiments
A few days ago, I read Thoughts on Performance Hints from Jeff and wrote a blog post about it.
One specific tip from those hints that I wanted to play around with is L-Cache performance.
I’ve read many articles about L-Cache, but I’ve never actually written any demos or tests for it (weird, right? Especially since I have a lot of demo code on my GitHub). Most of those articles are written in C. If I wanted to test this myself in Go or Lisp, I would likely need to disable the Garbage Collector (GC) to observe “pure” hardware behavior.
Then I realized I could write it in Rust, which is well-suited for this. So, here we go.
My CPU Details
First, I wanted to find my Mac’s CPU details so I could write tests depending on the cache line and cache sizes.
I can get these details using the following commands:
sysctl -n hw.l1dcachesize # 65536 => 64KB
sysctl -n hw.cachelinesize # 128
I’m not entirely sure if this is the correct answer, because Google provides the following specs:
Efficiency (E) Cores:
L1 Instruction Cache (L1i): 128 KB per core.
L1 Data Cache (L1d): 64 KB per core.
Performance (P) Cores:
L1 Instruction Cache (L1i): 192 KB per core, 12-way set-associative.
L1 Data Cache (L1d): 128 KB per core, 8-way set-associative.
If Google is right, it looks like sysctl is reporting the E-core details. However, based on the tests below, it seems they are running on the P-cores. Why? When I run taskpolicy -b cargo bench (which forces the background/E-cores), it is about 400% slower than the default. Therefore, I assume cargo bench runs on P-cores by default.
Test Cases
While writing the tests, I essentially ignored the CPU details I gathered since the answers seemed inconsistent. The only thing I am fairly sure of is that cargo bench runs on the P-cores. This shouldn’t affect the relative speed differences in cache access. Even if I don’t know the exact cache line size, a performance hit from an L-cache miss will still be obvious.
Sequential Access by Steps
I created a sequence and accessed the elements using different steps:
pub fn sequential_access_step4(data: &mut [i32]) {
for i in (0..data.len()).step_by(4) {
data[i] = 1;
}
}
pub fn sequential_access_step16(data: &mut [i32]) {
for i in (0..data.len()).step_by(16) {
data[i] = 1;
}
}
I benchmarked these two functions using a 64M Vec<i32>. Intuitively, the second one should be four times faster than the first, right? Here are the results:
Memory Access/sequential_access_step_by_4
time: [5.2634 ms 5.2797 ms 5.2990 ms]
Memory Access/sequential_access_step_by_16
time: [4.4792 ms 4.4846 ms 4.4915 ms]
It is actually not much faster.
The reason, I believe, is that an i32 is 4 bytes and the CPU loads a full cache line (128 bytes). Since 128 / 4 = 32, the CPU loads 32 elements at once. The runtimes of the two functions do not differ much because most of the cost is spent jumping to the next cache line regardless of the step size within that line.
Memory Access with Increments
This function accesses memory with a specific step (increment). I ran it with several count/increment pairs.
pub fn cache_line_hit_with_increment(count: usize, increment: usize) {
let size = count * increment;
let mut data: Vec<i32> = vec![0; size];
for j in (0..size).step_by(increment) {
data[j] += 1;
}
}
Access the memory with specific step. And run it with several count/step pairs. The result is:
| count\step | 1 | 16 | 512 | 1024 |
|---|---|---|---|---|
| 1 | 14.623 ns | 20.203 ns | 39.497 ns | 60.158 ns |
| 2 | 14.959 ns | 16.719 ns | 61.666 ns | 117.56 ns |
| 3 | 15.316 ns | 16.682 ns | 86.723 ns | 155.49 ns |
| 4 | 15.523 ns | 17.043 ns | 116.55 ns | 191.88 ns |
| 5 | 18.830 ns | 26.392 ns | 129.64 ns | 233.49 ns |
| 6 | 19.095 ns | 27.074 ns | 156.91 ns | 294.41 ns |
| 7 | 19.339 ns | 27.871 ns | 172.78 ns | 329.69 ns |
| 8 | 19.754 ns | 28.192 ns | 190.90 ns | 432.96 ns |
| 9 | 20.547 ns | 28.936 ns | 240.19 ns | 1327.3 ns |
| 10 | 20.838 ns | 29.175 ns | 249.96 ns | 1330.2 ns |
| 11 | 21.312 ns | 30.231 ns | 280.66 ns | 1402.7 ns |
| 12 | 21.545 ns | 30.510 ns | 283.95 ns | 1402.9 ns |
| 13 | 20.448 ns | 31.647 ns | 331.33 ns | 1466.5 ns |
| 14 | 20.798 ns | 31.434 ns | 324.00 ns | 1487.9 ns |
| 15 | 21.119 ns | 30.458 ns | 456.47 ns | 1474.7 ns |
| 16 | 21.363 ns | 30.754 ns | 397.97 ns | 1471.9 ns |
| 17 | 21.654 ns | 37.488 ns | 1322.7 ns | 1589.9 ns |
| 18 | 21.879 ns | 37.473 ns | 1503.8 ns | 1709.9 ns |
The timing increases significantly at 9 * 1024 (1024 * 4 * 9 = 36 KB) and 17 * 512 (512 * 4 * 17 = 34 KB). My assumption is that 32 KB is the effective cache size, and the time increases because the CPU needs to reload the cache. (However, the P-core should have 128 KB per core rather than 32 KB; I still don’t know why this discrepancy exists).
Matrix Hit
Since I have done a lot of Advent of Code (AOC) this year, this matrix test case was quite fun. I actually found performance issues with matrices previously. That post was about memory allocation performance; this time, it’s about memory access performance.
pub fn matrix_iter_row() -> i32 {
let col_num = 512;
let row_num = 1024;
let matrix = vec![vec![0i32; col_num]; row_num];
let mut sum: i32 = 0;
for r in 0..row_num {
for c in 0..col_num {
sum = sum.wrapping_add(matrix[r][c]);
}
}
sum
}
pub fn matrix_iter_col() -> i32 {
let col_num = 512;
let row_num = 1024;
let matrix = vec![vec![0i32; col_num]; row_num];
let mut sum: i32 = 0;
for c in 0..col_num {
for r in 0..row_num {
sum = sum.wrapping_add(matrix[r][c]);
}
}
sum
}
The performance results are:
Matrix Hit/matrix_iter_row => 194.06 µs
Matrix Hit/matrix_iter_col => 437.18 µs
Wrap Up
It was fun revealing these L-cache performance issues. I have to admit that I might never encounter this level of performance optimization in my daily work (at least for now).
But it’s always fun to learn something new, and these results actually support the experience shared in Jeff’s article.