NetBSD-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Testing memory performance
I'm developing a small tool that tests memory performance/throughput
across different environments. I'm noticing performance issues on
NetBSD-8, below are the details:
The tool creates a number of concurrent threads, each threads allocates
1 GiB memory segment and a 1 KiB transfer block. It pre-faults every
page by writing a single byte at every 4 KiB offset. It then calls
memcpy () in a loop, copying 1 KiB block until 1 GiB memory segment is
filled.
NetBSD and Linux have different versions of GCC, but I was hoping the
following flags would keep optimization differences to a minimum:
gcc -O1 -fno-builtin -march=westmere -Wall -pedantic -std=c11 \
-D_FILE_OFFSET_BITS=64 -D_XOPEN_SOURCE=700 -D_DEFAULT_SOURCE
Hardware has 48 GiB of RAM, For this test I'm using 16 threads x 1 GiB =
16 GiB total.
I'm seeing several issues on NetBSD:
1. When each thread calls mlock() to lock pages, sometimes when
unlocking those pages, munlock() fails with ENOMEM. It doesn't happen
every time, but frequently enough and I don't know why specifically
munlock() fails. Same code works correctly on Linux.
2. Performance with 16 concurrent threads is rather bad. Most threads
are idle 60% of the time (on Linux they are 100% busy), which suggests
some sort of contention somewhere. On NetBSD average throughput with 16
threads is around 5.8 GiB/sec, on Linux it is around 15.3 GiB/sec.
3. This issue affects both NetBSD and Linux. When using mlock() to
lock memory pages before issuing memcpy(), overall throughput drops
significantly. Threads seem to be serialized, while a few threads are
running, others are blocked for some reason. I don't know why mlock()
has this affect.
If anyone has any thoughts on this, please let me know.
Below are details of SMP architecture and test results
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 44
Model name: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Stepping: 2
CPU MHz: 1596.000
CPU max MHz: 2395.0000
CPU min MHz: 1596.0000
BogoMIPS: 4787.71
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0-3,8-11
NUMA node1 CPU(s): 4-7,12-15
NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock:
Thread 2 preflt=13504.86 msec, memcpy=2874.69 MiB/sec
Thread 7 preflt=14277.53 msec, memcpy=2891.39 MiB/sec
Thread 3 preflt=14765.99 msec, memcpy=2553.72 MiB/sec
Thread 14 preflt=15036.90 msec, memcpy=2288.19 MiB/sec
Thread 1 preflt=15126.01 msec, memcpy=2315.53 MiB/sec
Thread 12 preflt=15333.82 msec, memcpy=2071.52 MiB/sec
Thread 5 preflt=15603.25 msec, memcpy=1880.64 MiB/sec
Thread 6 preflt=15704.05 msec, memcpy=1662.66 MiB/sec
Thread 10 preflt=15693.48 msec, memcpy=1642.44 MiB/sec
Thread 4 preflt=15571.64 msec, memcpy=1557.73 MiB/sec
Thread 15 preflt=15574.60 msec, memcpy=1571.76 MiB/sec
Thread 9 preflt=15750.08 msec, memcpy=2170.44 MiB/sec
Thread 13 preflt=15588.69 msec, memcpy=1900.24 MiB/sec
Thread 8 preflt=15587.50 msec, memcpy=2043.66 MiB/sec
Thread 16 preflt=15265.48 msec, memcpy=1884.74 MiB/sec
Thread 11 preflt=15294.87 msec, memcpy=2272.75 MiB/sec
Total transfer rate: 5817.56 MiB/sec
NetBSD: 16 threads x 1 GiB, using 1 KiB memcpy size, with mlock:
Thread 2 preflt=5.27 msec, memcpy=2595.67 MiB/sec
Thread 3 preflt=5.37 msec, memcpy=2550.90 MiB/sec
Thread 16 preflt=5.02 msec, memcpy=2770.11 MiB/sec
Thread 4 preflt=4.12 msec, memcpy=3209.06 MiB/sec
Thread 15 preflt=5.31 msec, memcpy=2496.82 MiB/sec
Thread 13 preflt=7.46 msec, memcpy=3083.72 MiB/sec
Thread 5 preflt=5.49 msec, memcpy=2766.81 MiB/sec
Thread 14 preflt=6.94 msec, memcpy=2574.98 MiB/sec
Thread 8 preflt=6.53 msec, memcpy=2201.47 MiB/sec
Thread 12 preflt=4.90 msec, memcpy=2814.79 MiB/sec
Thread 10 preflt=4.41 msec, memcpy=2615.27 MiB/sec
Thread 6 preflt=6.18 msec, memcpy=2844.57 MiB/sec
Thread 9 preflt=5.38 msec, memcpy=2976.05 MiB/sec
Thread 7 preflt=4.81 msec, memcpy=2828.54 MiB/sec
Thread 11 preflt=5.10 msec, memcpy=2778.69 MiB/sec
Thread 1 preflt=3.84 msec, memcpy=3229.88 MiB/sec
Total transfer rate: 3789.33 MiB/sec
Linux: 16 threads x 1 GiB, using 1 KiB memcpy size, no mlock:
Thread 5 preflt=1122.06 msec, memcpy=990.24 MiB/sec
Thread 2 preflt=1137.94 msec, memcpy=990.41 MiB/sec
Thread 15 preflt=1125.65 msec, memcpy=982.23 MiB/sec
Thread 4 preflt=1130.02 msec, memcpy=981.37 MiB/sec
Thread 9 preflt=1130.47 msec, memcpy=982.23 MiB/sec
Thread 13 preflt=1127.70 msec, memcpy=982.00 MiB/sec
Thread 3 preflt=1136.35 msec, memcpy=985.89 MiB/sec
Thread 12 preflt=1133.20 msec, memcpy=985.05 MiB/sec
Thread 8 preflt=1136.61 msec, memcpy=985.21 MiB/sec
Thread 11 preflt=1147.40 msec, memcpy=989.12 MiB/sec
Thread 14 preflt=1137.01 msec, memcpy=980.20 MiB/sec
Thread 7 preflt=1140.52 msec, memcpy=980.16 MiB/sec
Thread 6 preflt=1142.21 msec, memcpy=981.06 MiB/sec
Thread 10 preflt=1143.08 msec, memcpy=982.90 MiB/sec
Thread 16 preflt=1146.96 msec, memcpy=988.34 MiB/sec
Thread 1 preflt=1150.99 msec, memcpy=983.68 MiB/sec
Total transfer rate: 15314.12 MiB/sec
Linux: 16 threads x 1 GiB, using 1 KiB memcpy size, with mlock:
Thread 5 preflt=15.72 msec, memcpy=1555.03 MiB/sec
Thread 4 preflt=7.49 msec, memcpy=1548.15 MiB/sec
Thread 3 preflt=15.07 msec, memcpy=1471.69 MiB/sec
Thread 2 preflt=15.98 msec, memcpy=1517.09 MiB/sec
Thread 1 preflt=16.04 msec, memcpy=1533.20 MiB/sec
Thread 6 preflt=4.13 msec, memcpy=5191.23 MiB/sec
Thread 7 preflt=4.03 msec, memcpy=5825.18 MiB/sec
Thread 8 preflt=4.19 msec, memcpy=5265.08 MiB/sec
Thread 10 preflt=5.64 msec, memcpy=3359.36 MiB/sec
Thread 9 preflt=5.68 msec, memcpy=3354.28 MiB/sec
Thread 11 preflt=4.21 msec, memcpy=5255.38 MiB/sec
Thread 12 preflt=4.04 msec, memcpy=5250.94 MiB/sec
Thread 13 preflt=4.73 msec, memcpy=4224.99 MiB/sec
Thread 15 preflt=5.61 msec, memcpy=3311.98 MiB/sec
Thread 14 preflt=5.69 msec, memcpy=3312.76 MiB/sec
Thread 16 preflt=3.88 msec, memcpy=6158.48 MiB/sec
Total transfer rate: 2800.76 MiB/sec
Home |
Main Index |
Thread Index |
Old Index