Certainly faster writing. Read speed is about the same for the EVO (on real blocks of uncompressible data, not the imaginary compressable or zerod blocks that they use to report their 'maximum').

XPoint over NVMe has only two metrics that people need to know about to understand how it fits into the ethos: (1) More durability, up to 33,000 rewrites apparently (many people have had to calculate it, Intel refuses to say outright what it is because it is so much lower than what they originally said it would be). (2) Lower latency.

So, for example, NVMe devices using Intel's XPoint have an advertised latency of around 10uS. That is, you submit a READ request, and 10uS later you have the data in hand. The 960 EVO, which I have one around here somewhere... ah, there it is... the 960 EVO has a read latency of around 87uS.

This is called the QD1 latency. It does not translate to the full bandwidth of the device as you can queue multiple commands to the device and pipeline the responses. In fact, a normal filesystem sequential read always queues read-ahead I/O so even an open/read*/close sequence generally operates at around QD4 (4 read commands in progress at once) and not QD1.

Here's the 960 EVO and some randread tests on it at QD1 and QD4.

nvme1: mem 0xc7500000-0xc7503fff irq 32 at device 0.0 on pci2

nvme1: mapped 8 MSIX IRQs

nvme1: NVME Version 1.2 maxqe=16384 caps=00f000203c033fff

nvme1: Model Samsung_SSD_960_EVO_250GB BaseSerial S3ESNX0J219064Y nscount=1

nvme1: Request 64/32 queues, Returns 8/8 queues, rw-sep map (8, 8)

nvme1: Interrupt Coalesce: 100uS / 4 qentries

nvme1: Disk nvme1 ns=1 blksize=512 lbacnt=488397168 cap=232GB serno=S3ESNX0J219064Y-1

(/dev/nvme1s1b is a partition filled with uncompressible data)

xeon126# randread /dev/nvme1s1b 4096 100 1

device /dev/nvme1s1b bufsize 4096 limit 16.000GB nprocs 1

11737/s avg= 85.20uS bw=48.07 MB/s lo=66.22uS, hi=139.77uS stddev=7.50uS

11458/s avg= 87.28uS bw=46.92 MB/s lo=68.50uS, hi=154.20uS stddev=7.01uS

11469/s avg= 87.19uS bw=46.98 MB/s lo=69.97uS, hi=151.97uS stddev=6.95uS

11477/s avg= 87.13uS bw=47.01 MB/s lo=69.31uS, hi=158.03uS stddev=7.03uS

And here is QD4 (really QD1 x 4 threads on 4 HW queues):

xeon126# randread /dev/nvme1s1b 4096 100 4

device /dev/nvme1s1b bufsize 4096 limit 16.000GB nprocs 4

44084/s avg= 90.74uS bw=180.57MB/s lo=65.17uS, hi=237.92uS stddev=16.94uS

44205/s avg= 90.49uS bw=181.05MB/s lo=65.38uS, hi=222.21uS stddev=16.56uS

44202/s avg= 90.49uS bw=181.04MB/s lo=65.19uS, hi=221.48uS stddev=16.72uS

44131/s avg= 90.64uS bw=180.75MB/s lo=64.44uS, hi=245.91uS stddev=16.81uS

44210/s avg= 90.48uS bw=181.08MB/s lo=63.73uS, hi=232.05uS stddev=16.74uS

So, as you can see, at QD1 the 960 EVO is doing around 11.4K transactions/sec and at QD4 it is doing around 44K transactions/sec. If I use a larger block size you can see the bandwidth lift off:

xeon126# randread /dev/nvme1s1b 32768 100 4

device /dev/nvme1s1b bufsize 32768 limit 16.000GB nprocs 4

19997/s avg=200.03uS bw=655.26MB/s lo=125.02uS, hi=503.26uS stddev=55.24uS

20090/s avg=199.10uS bw=658.23MB/s lo=124.62uS, hi=522.04uS stddev=54.83uS

20034/s avg=199.66uS bw=656.47MB/s lo=123.63uS, hi=495.74uS stddev=55.59uS

20008/s avg=199.92uS bw=655.62MB/s lo=123.50uS, hi=500.24uS stddev=55.92uS

20034/s avg=199.66uS bw=656.47MB/s lo=125.17uS, hi=488.30uS stddev=55.02uS

20000/s avg=200.00uS bw=655.35MB/s lo=123.19uS, hi=504.18uS stddev=55.98uS

And if I use a deeper queue I can max-out the bandwidth. On this particular device, random blocks of uncompressable data at 32KB limits out at around 1 GByte/sec. I'll also show 64KB and 128KB:

xeon126# randread /dev/nvme1s1b 32768 100 64

device /dev/nvme1s1b bufsize 32768 limit 16.000GB nprocs 64

32989/s avg=1940.03uS bw=1080.98MB/s lo=1396.85uS, hi=3343.49uS stddev=291.76uS

32928/s avg=1943.62uS bw=1078.84MB/s lo=1386.21uS, hi=3462.96uS stddev=297.14uS

33012/s avg=1938.67uS bw=1081.73MB/s lo=1371.41uS, hi=3676.83uS stddev=290.64uS

33217/s avg=1926.70uS bw=1088.44MB/s lo=1385.18uS, hi=3344.11uS stddev=282.63uS

xeon126# randread /dev/nvme1s1b 65536 100 64

device /dev/nvme1s1b bufsize 65536 limit 16.000GB nprocs 64

14739/s avg=4342.19uS bw=965.93MB/s lo=3189.96uS, hi=6937.79uS stddev=466.51uS

14813/s avg=4320.60uS bw=970.67MB/s lo=3273.82uS, hi=6327.81uS stddev=442.92uS

14991/s avg=4269.15uS bw=982.43MB/s lo=3205.54uS, hi=6355.74uS stddev=432.94uS

xeon126# randread /dev/nvme1s1b 131072 100 64

device /dev/nvme1s1b bufsize 131072 limit 16.000GB nprocs 64

8052/s avg=7948.27uS bw=1055.38MB/s lo=6575.48uS, hi=9744.12uS stddev=496.41uS

8150/s avg=7853.00uS bw=1068.01MB/s lo=6540.51uS, hi=9496.64uS stddev=465.37uS

7986/s avg=8013.88uS bw=1046.72MB/s lo=6446.20uS, hi=9815.01uS stddev=518.95uS

--

Now the thing to note here is that with deeper queues the latency also goes up. At QD4 the latency is around 200uS, for example.

Where Optane (aka XPoint) 'wins' is on latency. I don't have an Optane device to test yet, but Intel is saying an average latency of 10uS at QD1 over the NVMe interface (over a direct DDR interface it will of course be much faster). That's the 'win'. But its completely irrelevant for the consumer case because the consumer case is for multi-block transfers and filesystems always do read ahead (i.e. at least QD4). A disk cache does not need 10uS latency to be effective.

-Matt