Scenario / Questions

I have two machines connected with 10Gbit Ethernet. Let one of them be NFS server and another will be NFs client.

Testing network speed over TCP with iperf shows ~9.8 Gbit/s throughput in both directions, so network is OK.

Testing NFS server’s disk performance:

dd if=/dev/zero of=/mnt/test/rnd2 count=1000000

Result is ~150 MBytes/s, so disk works fine for writing.

Server’s /etc/exports is:

/mnt/test 192.168.1.0/24(rw,no_root_squash,insecure,sync,no_subtree_check)

Client mounts this share to it’s local /mnt/test with following options:

node02:~ # mount | grep nfs
192.168.1.101:/mnt/test on /mnt/test type nfs4 (rw,relatime,sync,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.102,local_lock=none,addr=192.168.1.101)

If I try to download a large file (~5Gb) on the client machine from the NFS share, I get ~130-140 MBytes/s performance which is close to server’s local disk performance, so it’s satisfactory.

But when I try do upload a large file to the NFS share, upload starts at ~1.5 Mbytes/s, slowly increases up to 18-20 Mbytes/s and stops increasing.
Sometimes the share “hangs” for a couple of minutes before upload actually starts, i.e. traffic between hosts becomes close to zero and if I execute ls /mnt/test, it does not return during a minute or two. Then ls command returns and upload starts at it’s initial 1.5Mbit/s speed.

When upload speed reaches it’s maximum (18-20 Mbytes/s), I run iptraf-ng and it shows ~190 Mbit/s traffic on the network interface, so network is not a bottleneck here, as well as server’s HDD.

What I tried:

1.
Set up an NFS server on a third host which was connected only with a 100Mbit Ethernet NIC. Results are analogical: DL shows good performance and nearly full 100Mbit network utilization, upload does not perform faster than hundreds of kilobytes per second, leaving network utilization very low (2.5 Mbit/s according to iptraf-ng).

2.
I tried to tune some NFS parameters:

  • sync or async

  • noatime

  • no hard

  • rsize and wsize are maximal in my examples, so I tried to
    decrease them in several steps down to 8192

3.
I tried to switch client and server machines (set up NFS server on former client and vice versa). Moreover, there are six more servers with the same configuration, so I tried to mount them to each other in different variations. Same result.

4.
MTU=9000, MTU=9000 and 802.3ad link aggregation, link aggregation with MTU=1500.

5.
sysctl tuning:

node01:~ # cat /etc/sysctl.conf 
net.core.wmem_max=16777216
net.core.rmem_max=16777216
net.ipv4.tcp_rmem= 10240 873800 16777216
net.ipv4.tcp_wmem= 10240 873800 16777216
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.core.netdev_max_backlog = 5000

Same result.

6.
Mount from localhost:

node01:~ # cat /etc/exports
/mnt/test *(rw,no_root_squash,insecure,sync,no_subtree_check)
node01:~ # mount -t nfs -o sync localhost:/mnt/test /mnt/testmount/

And here I get the same result: download from /mnt/testmount/ is fast, upload to /mnt/testmount/ is very slow, not faster than 22 MBytes/s and there is a small delay before transfer actually starts. Does it mean that network stack works flawlessly and the problem is in NFS?

All of this did not help, results didn’t differ significantly from the default configuration. echo 3 > /proc/sys/vm/drop_caches was executed before all tests.

MTU of all NICS at all 3 hosts is 1500, no non-standard network tuning performed. Ethernet switch is Dell MXL 10/40Gbe.

OS is CentOS 7.

node01:/mnt/test # uname -a
Linux node01 3.10.0-123.20.1.el7.x86_64 #1 SMP Thu Jan 29 18:05:33 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

What settings am I missing? How to make NFS write quickly and without hangs?

Find below all possible solutions or suggestions for the above questions..

Suggestion: 1

You use the sync-option in your export statement. This means that the server only confirms write operations after they are actually written to the disk. Given you have a spinning disk (i.e. no SSD), this requires on average at least 1/2 revolution of the disk per write operation, which is the cause of the slowdown.

Using the async setting, the server immediately acknowledges the write-operation to the client when it is processed but not yet written to the disk. This is a little bit more unreliable, e.g., in case of a power failure when the client received an ack for an operation that did not happened. However, it delivers a huge increase in write-performance.

(edit) I just saw that you already tested the options async vs sync. However, I am almost sure that this is the cause of your performance degradation issue — I once had exactly the same indication with an idencitcal setup. Maybe you test it again. Did you give the async option at the export statement of the server AND in the mount operation at the client at the same time?

Suggestion: 2

It can be a problem related to packet size and latency. Try the following:

The report back your results.

Suggestion: 3

http://veerapen.blogspot.com/2011/09/tuning-redhat-enterprise-linux-rhel-54.html

Configuring the Linux scheduler on systems with hardware RAID and changing the default from [cfq] to [noop] gives I/O improvements.

Use the nfsstat command, to calculate percentage of reads/writes. Set the RAID controller cache ratio to match.

For heavy workloads you will need to increase the number of NFS server threads.

Configure the nfs threads to write without delay to the disk using the no_delay option.

Tell the Linux kernel to flush as quickly as possible so that writes are kept as small as possible. In the Linux kernel, dirty pages writeback frequency can be controlled by two parameters.

For faster disk writes, use the filesystem data=journal option and prevent updates to file access times which in itself results in additional data written to the disk. This mode is the fastest when data needs to be read from and written to disk at the same time where it outperforms all other modes