That still sounds to me like the test is a bit off. If you've already
recorded the start time of each thread, then the time that the threads
are blocked from running would be included in the per-thread rate, thus
causing it to appear much slower.
No because start/end times are taken for specific operations like pre-faulting or memcpy. It doesn't tell you what this thread is doing in relation to other threads, so a thread can be blocked for some time and then scheduled to run and start time taken, how would this latency be accounted for if it occurred before start time was taken?
Maybe think of it as a simple example, let's say memory bus has maximum bandwidth of 10 GiB/sec and you have two threads A and B, each doing memcpy of 10 GiB.
Scenario 1 - both threads run in parallel and share memory bus bandwidth:
------> time in seconds
AA thread runs for 2 seconds and does memcpy at 5 GiB/sec
BB thread runs for 2 seconds and does memcpy at 5 GiB/sec
Aggregate throughput = (2 threads * 10 GiB) / 2 second = 10 GiB/sec
Scenario 2 - each thread runs in sequence and uses full memory bus bandwidth
------> time in seconds
A thread runs for 1 second and does memcpy at 10 GiB/sec
L lock contention causes latency of 1 second
B thread runs for 1 second and does memcpy at 10 GiB/sec
Aggregate throughput = (2 threads * 10 GiB) / 3 second = 6.6 GiB/sec