Scenario / Questions

I need to transfer a huge amount of mp3s between two serves (Ubuntu).
By huge I mean about a million files which are on average 300K.
I tried with scp but it would have taken about a week. (about 500 KB/s)
If I transfer a single file by HTTP, I get 9-10 MB/s, but I don’t know how to transfer all of them.

Is there a way to transfer all of them quickly?

Find below all possible solutions or suggestions for the above questions..

Suggestion: 1

I would recommend tar. When the file trees are already similar, rsync performs very well. However, since rsync will do multiple analysis passes on each file, and then copy the changes, it is much slower than tar for the initial copy. This command will likely do what you want. It will copy the files between the machines, as well as preserve both permissions and user/group ownerships.

tar -c /path/to/dir | ssh remote_server 'tar -xvf - -C /absolute/path/to/remotedir'

As per Mackintosh’s comment below this is the command you would use for rsync

rsync -avW -e ssh /path/to/dir/ remote_server:/path/to/remotedir

Suggestion: 2

External hard drive and same-day courier delivery.

Suggestion: 3

I’d use rsync.

If you’ve got them exported via HTTP with directory listings available, you could use wget and the –mirror argument, too.

You’re already seeing that HTTP is faster than SCP because SCP is encrypting everything (and thus bottlenecking on the CPU). HTTP and rsync are going to move faster because they’re not encrypting.

Here’s some docs on setting up rsync on Ubuntu: https://help.ubuntu.com/community/rsync

Those docs talk about tunneling rsync over SSH, but if you’re just moving data around on a private LAN you don’t need SSH. (I’m assuming you are on a private LAN. If you’re getting 9-10MB/sec over the Internet then I want to know what kind of connections you have!)

Here are some other very basic docs that will allow you to setup a relative insecure rsync server (w/ no dependence on SSH): http://transamrit.net/docs/rsync/

Suggestion: 4

Without much discussion, use netcat, network swissarmy knife. No protocol overhead, you’re directly copying to the network socket.
Example

srv1$ tar cfv - *mp3 | nc -w1 remote.server.net 4321

srv2$ nc -l -p 4321 |tar xfv -

Suggestion: 5

With lots of files if you do go with rsync, I would try to get version 3 or above on both ends. The reason being that a lesser version will enumerate every file before it starts the transfer. The new feature is called incremental-recursion.

A new incremental-recursion algorithm
is now used when rsync is talking
to another 3.x version. This starts the transfer going more quickly
(before all the files have been found), and requires much less memory.
See the –recursive option in the manpage for some restrictions.

Suggestion: 6

rsync, like others have already recommended. If the CPU overhead from the encryption is a bottleneck, use another less CPU intensive algorithm, like blowfish. E.g. something like

rsync -ax -e 'ssh -c blowfish' /local/path user@host:/remote/path

Suggestion: 7

In moving 80 TB of data (millions of tiny files) yesterday, switching from rsync to tar proved to be much faster, as we stopped trying

# slow
rsync -av --progress /mnt/backups/section01/ /mnt/destination01/section01

and switched to tar instead…

# fast
cd /mnt/backups/
tar -cf - section01 | tar -xf - -C /mnt/destination01/ 

Since these servers are on the same LAN, the destination is NFS-mounted on the source system, which is doing the push. No make it even faster, we decided not to preserve the atime of files:

mount -o remount,noatime /mnt/backups
mount -o remount,noatime /mnt/destination01

The graphic below depicts the difference the change from rsync to tar made. It was my boss’s idea and my colleague both executed it and made the great writeup on his blog. I just like pretty pictures. 🙂

rsync_vs_tar

Suggestion: 8

When copying a large number of files, I found that tools like tar and rsync are more inefficient than they need to be because of the overhead of opening and closing many files. I wrote an open source tool called fast-archiver that is faster than tar for these scenarios: https://github.com/replicon/fast-archiver; it works faster by performing multiple concurrent file operations.

Here’s an example of fast-archiver vs. tar on a backup of over two million files; fast-archiver takes 27 minutes to archive, vs. tar taking 1 hour 23 minutes.

$ time fast-archiver -c -o /dev/null /db/data
skipping symbolic link /db/data/pg_xlog
1008.92user 663.00system 27:38.27elapsed 100%CPU (0avgtext+0avgdata 24352maxresident)k
0inputs+0outputs (0major+1732minor)pagefaults 0swaps

$ time tar -cf - /db/data | cat > /dev/null
tar: Removing leading `/' from member names
tar: /db/data/base/16408/12445.2: file changed as we read it
tar: /db/data/base/16408/12464: file changed as we read it
32.68user 375.19system 1:23:23elapsed 8%CPU (0avgtext+0avgdata 81744maxresident)k
0inputs+0outputs (0major+5163minor)pagefaults 0swaps

To transfer files between servers, you can use fast-archiver with ssh, like this:

ssh postgres@10.32.32.32 "cd /db; fast-archive -c data --exclude=data/\*.pid" | fast-archiver -x

Suggestion: 9

I use the tar through netcat approach as well, except I prefer to use socat — a lot more power to optimize for your situation — for example, by tweaking mss. (Also, laugh if you want, but I find socat arguments easier to remember because they’re consistent). So for me, this is very very common lately as I’ve been moving things to new servers:

host1$ tar cvf - filespec | socat stdin tcp4:host2:portnum

host2$ socat tcp4-listen:portnum stdout | tar xvpf -

Aliases are optional.

Suggestion: 10

Another alternative is Unison. Might be slightly more efficient than Rsync in this case, and it’s somewhat easier to set up a listener.

Suggestion: 11

Looks like there may be a couple of typos in the top answer. This may work better:

tar -cf - /path/to/dir | ssh remote_server 'tar -xvf - -C /path/to/remotedir'

Suggestion: 12
  • Network File System (NFS) and then copy them with whatever you like, e.g. Midnight Commander (mc), Nautilus (from gnome). I have used NFS v3 with good results.
  • Samba (CIFS) and then copy the files with whatever you want to, but I have no idea how efficient it is.
  • HTTP with wget --mirror as Evan Anderson has suggested or any other http client. Be careful not to have any nasty symlinks or misleading index files. If all you have is MP3s, you should be safe.
  • rsync. I have used it with pretty good results and one of its nice features is that you can interrupt and resume the transfer later.

I’ve noticed that other people have recommended using netcat. Based on my experience with it I can say that it’s slow compared with the other solutions.

Suggestion: 13

Thanks to Scott Pack’s wonderful answer (I didn’t know how to do this with ssh before), I can offer this improvement (if bash is your shell). This will add parallel compression, a progress indicator and check integrity across the network link:

tar c file_list |
    tee >(sha512sum >&2) |
    pv -prab |
    pigz -9 |
    ssh [user@]remote_host '
        gunzip |
        tee >(sha512sum >&2) |
        tar xC /directory/to/extract/to
    '

pv is a nice progress viewer program for your pipe and pigz is a parallel gzip program that uses as many threads as your CPU has by default (I believe up to 8 max). You can tune the compression level to better fit the ratio of CPU to network bandwith and swap it out with pxz -9e and pxz -d if you have much more CPU than bandwidth. You only have to verify that the two sums match upon completion.

This option is useful for very large amounts of data as well as high latency networks, but not very helpful if the link is unstable and drops. In those cases, rsync is probably the best choice as it can resume.

Sample output:

6c1fe5a75cc0280709a794bdfd23d7b8b655f0bbb4c320e59729c5cd952b4b1f84861b52d1eddb601259e78249d3e6618f8a1edbd20b281d6cd15f80c8593c3e  -                     ]
 176MiB [9.36MiB/s] [9.36MiB/s] [                                            <=>                                                                        ]
6c1fe5a75cc0280709a794bdfd23d7b8b655f0bbb4c320e59729c5cd952b4b1f84861b52d1eddb601259e78249d3e6618f8a1edbd20b281d6cd15f80c8593c3e  -

For block devices:

dd if=/dev/src_device bs=1024k |
    tee >(sha512sum >&2) |
    pv -prab |
    pigz -9 |
    ssh [user@]remote_host '
        gunzip |
        tee >(sha512sum >&2) |
        dd of=/dev/src_device bs=1024k
    '

Obviously, make sure they’re the same size or limit with count=, skip=, seek=, etc.

When I copy filesystems this way, I’ll often first dd if=/dev/zero of=/thefs/zero.dat bs=64k && sync && rm /thefs/zero.dat && umount /thefs to zero most of the unused space, which speeds up the xfer.

Suggestion: 14

I don’t think you’re going to do any better than scp unless you install faster network cards. If you’re doing this over the internet, that will not help though.

I would recommend using rsync. It may not be any faster, but at least if it fails (or you shut it down because it’s taking too long), you can resume where you left off next time.

If you can connect the 2 machines directly using gigabit ethernet, that will probably be the fastest.

Suggestion: 15

For 100Mb/s the theoretical throughput is 12.5 MB/s, so at 10MB/s you are doing pretty well.

I would also echo the suggestion to do rsync, probably through ssh. Something like:

rsync -avW -e ssh $SOURCE $USER@$REMOTE:$DEST

At 100Mb/s your CPUs should be able to handle the encrypt/decrypt without appreciably impacting the data rate. And if you interrupt the data flow, you should be able to resume from where you left off. Beware, with “millions” of files the startup will take a while before it actually transfers anything.

Suggestion: 16

I’ve encountered this, except that I was transferring Oracle logs.

Here’s the breakdown

  • scp

    inefficient and encrypted (encrypted = slower than unencrypted 
    depending on the link and your processor) 
    
  • rsync

    efficient but typically encrypted (though not necessarily)
    
  • FTP/HTTP

    both seem to be efficient, and both are plaintext. 
    

I used FTP with great success (where great success is equivalent to ~700Mb/s on a Gb network). If you’re getting 10MB (which is equal to 80Mb/s), something is probably wrong.

What can you tell us about the source and destination of the data? Is it single drive to single drive? RAID to USB?

I know this question already has an answer, but if your network is going this slow on a Gb/s crossover cable, something absolutely needs fixed.

Suggestion: 17

You didn’t mention if the two machines are on the same LAN, or if a secure channel (i.e. using SSH) is mandatory, but another tool you could use is netcat.

I would use the following on the receiving machine:

cd <destdir>
netcat -l -p <port> | gunzip | cpio -i -d -m

Then on the sending side:

cd <srcdir>
find . -type f | cpio -o | gzip -1 | netcat <desthost> <port>

It has the following advantages:

  • No CPU overhead for the encryption that ssh has.
  • The gzip -1 provides light compression without saturating a CPU so it makes a good trade-off, giving a bit of compression while maintaining maximum throughput. (Probably not that advantageous for MP3 data, but doesn’t hurt.)
  • If you can partition the files up into groups, you can run a two or more pipes in parallel and really ensure you’re saturating your network bandwidth.

e.g.,

find <dir1> <dir2> -type f | cpio -o | gzip -1 | netcat <desthost> <portone>
find <dir3> <dir4> -type f | cpio -o | gzip -1 | netcat <desthost> <porttwo>

Notes:

  • Whatever way you transfer, I would probably run an rsync or unison afterwards to ensure you got everything.
  • You could use tar instead of cpio if you prefer.
  • Even if you do end up using ssh, I would ensure it is not using any compression itself, and pipe through gzip -1 yourself instead to avoid CPU saturation. (Or at least set the CompressionLevel to 1.)

Suggestion: 18

A simple scp with proper options will easily reach 9-10 MB/s over LAN:

scp -C -c arcfour256 ./local/files.mp3 remoteuser@remoteserver:/opt/remote

With those options it’s likely that the throughput became 4x or 5x faster than no options (default)

Suggestion: 19

If you have ftp server in src side, you can use ncftpget from ncftp site. It works prefect with small files as it uses tar internally.

One comparison shows this: moving 1.9GB small files (33926 files)

  1. Using scp takes 11m59s
  2. Using rsync takes 7m10s
  3. Using ncftpget takes 1m20s

Suggestion: 20

You can also try using the BBCP command to do your transfer. It’s a buffered parallel ssh that really screams. We can usually get 90%+ line-rate provided we can keep the pipe fed.

$ bbcp -s 8 -w 64M -N io 'tar -cO srcdirectory' desthostname:'tar -x -C destdir'

Normally, we try real hard to avoid having to move suff around. We use ZFS pools that we can always just “add” more disk space to. But sometimes… you just have to move stuff. If we have a “live” filesystem that may take hours (or days) to copy even when going full-blast.. we do the ole two step zfs send routine:

  1. Make a ZFS snapshot, and transfer to the new pool on the new machine. Let it take as long as it takes.
  2. Make a second snapshot, and send it as an incremental. The incremental snapshot only includes the (much smaller) change-set since the first, so it goes through relatively quick.
  3. Once the incremental snapshot is completed you can turn of the original and cut over to the new copy and your “offline downtime” is kept to a minimum.

We also send our zfs dumps over BBCP as well… it maximizes our network utilization and minimizes the transfer times.

BBCP is freely available, you can google it, and it’s a straight-foward compile. Just copy it into your /usr/local/bin on both src and destination machines and it’ll pretty much just work.

Suggestion: 21

I guess my answer is a little late here, but I made good experiences with using mc (Midnight Commander) on one server to connect via SFTP to the other server.

The option to connect via FTP is in the “Left” and “Right” menus, by entering the address like this:

/#ftp:name@server.xy/

or

/#ftp:name@ip.ad.dr.ess/

The you can navigate and do file operations almost like on a local filesystem.

It has a built-in option to do the copying in background, but I prefer using the screen command and detach from the screen while mc is copying (I think it runs faster then too).

Suggestion: 22

To @scottpack answer of rSync option

To display progress of the upload use ‘–progess’ as option after -avW in the command as shown below.

rsync -avW --progress -e ssh /path/to/dir/ remote_server:/path/to/remotedir

enter image description here

Suggestion: 23

Here is a quick benchmark to compare some techniques,

  • Source is a 4-core Intel(R) Xeon(R) CPU E5-1620 @ 3.60GHz with 250
    Mbps and SATA drive
  • Destination is a 6-core Intel(R) Xeon(R) CPU
    E-2136 @ 3.30GHz with 1 Gbps bandwidth and SSD drive

Number of files : 9632,
Total size : 814 MiB,
Avg size : 84 KiB

  • RSYNC : 1m40.570s
  • RSYNC + COMPRESSION : 0m26.519s
  • TAR + NETCAT : 1m58.763s
  • TAR + COMPRESSION + NETCAT : 0m28.009s

Command for tar/netcat was :

Source : tar -cf - /sourcedir/ | nc -v 11.22.33.44 5000
Dest : nc -v -l 5000 | tar -xf -

Suggestion: 24

rsync or you might wish to tar it so its all within one file and then scp. If you lack the diskspace you can pipe the tar directly over ssh while its being made.

Suggestion: 25

If you’re sending over MP3’s and other compressed files, you won’t gain much from any solution that tries to further compress those files. The solution would be something that can create multiple connections between both servers and thus put more stress on the bandwidth between the two system. Once this maxes out, there’s not much that can be gained without improving your hardware. (Faster network cards between those servers, for example.)

Suggestion: 26

I tried couple of tools for copying a 1GB file
The result is below:
HTTP the fastest, with wget -c
nc second in line
scp slowest, and failed couple of times. No way to resume
rsync uses ssh as a backend, thus the same result.
In conclusion, I would go for http with wget -bqc and give it some time.
Hope that this helps

Suggestion: 27

I had to copy the BackupPC disk into another machine.

I used rsync.

The machine had 256 MB of memory.

The procedure I followed was this one:

  • executed rsync without -H (took 9 hours)
  • when rsync finished, I synchronized the cpool directory and started with the pc directory; I cut the transfer.
  • then restarted rsync with -H flag, and all the files hard linked in pc directory were correctly transfered (the procedure found all the real files in cpool and then linked to the pc directory) (took 3 hours).

In the end I could verify with df -m that no extra space was spent.

By this way I elude the problem with the memory and rsync. All time I can verify the performance using top and atop and finally I transferred 165GB of data.