On 11/20/2018 4:54 AM, Sad Clouds wrote:
On Mon, 19 Nov 2018 22:10:41 -0500 Eric Hawicz <erh%nimenees.com@localhost> wrote:The only way I can see that you'd end up with a total transfer rate around 5GB/s is if you didn't actually manage to get the threads running in parallel, but instead have perhaps 2-3 running at a time, then the next 2-3 don't even start until those first few finish.That is exactly what happens, other threads are blocked from running, because NetBSD VM subsystem that allocates pages is hitting single lock and causing contention.
That still sounds to me like the test is a bit off. If you've already recorded the start time of each thread, then the time that the threads are blocked from running would be included in the per-thread rate, thus causing it to appear much slower.
Originally, you said: "The tool creates a number of concurrent threads, each threads allocates 1 GiB memory segment and a 1 KiB transfer block. It pre-faults every page by writing a single byte at every 4 KiB offset. It then calls memcpy () in a loop, copying 1 KiB block until 1 GiB memory segment is filled." So, I'm imagining each thread has code that does the following sequence of operations: * Allocate 1GB memory * Pre-fault each page * Notify that we're ready to start and wait until all threads are ready * Record this thread's start time * Perform memcpy * Record this thread's end time Is that what you're doing?