Scenario / Questions
I need to transfer a huge amount of mp3s between two serves (Ubuntu).
By huge I mean about a million files which are on average 300K.
I tried with
scp but it would have taken about a week. (about 500 KB/s)
If I transfer a single file by HTTP, I get 9-10 MB/s, but I don’t know how to transfer all of them.
Is there a way to transfer all of them quickly?
Find below all possible solutions or suggestions for the above questions..
I would recommend tar. When the file trees are already similar, rsync performs very well. However, since rsync will do multiple analysis passes on each file, and then copy the changes, it is much slower than tar for the initial copy. This command will likely do what you want. It will copy the files between the machines, as well as preserve both permissions and user/group ownerships.
tar -c /path/to/dir | ssh remote_server 'tar -xvf - -C /absolute/path/to/remotedir'
As per Mackintosh’s comment below this is the command you would use for rsync
rsync -avW -e ssh /path/to/dir/ remote_server:/path/to/remotedir
External hard drive and same-day courier delivery.
I’d use rsync.
If you’ve got them exported via HTTP with directory listings available, you could use wget and the –mirror argument, too.
You’re already seeing that HTTP is faster than SCP because SCP is encrypting everything (and thus bottlenecking on the CPU). HTTP and rsync are going to move faster because they’re not encrypting.
Here’s some docs on setting up rsync on Ubuntu: https://help.ubuntu.com/community/rsync
Those docs talk about tunneling rsync over SSH, but if you’re just moving data around on a private LAN you don’t need SSH. (I’m assuming you are on a private LAN. If you’re getting 9-10MB/sec over the Internet then I want to know what kind of connections you have!)
Here are some other very basic docs that will allow you to setup a relative insecure rsync server (w/ no dependence on SSH): http://transamrit.net/docs/rsync/
Without much discussion, use netcat, network swissarmy knife. No protocol overhead, you’re directly copying to the network socket.
srv1$ tar cfv - *mp3 | nc -w1 remote.server.net 4321 srv2$ nc -l -p 4321 |tar xfv -
With lots of files if you do go with rsync, I would try to get version 3 or above on both ends. The reason being that a lesser version will enumerate every file before it starts the transfer. The new feature is called incremental-recursion.
A new incremental-recursion algorithm
is now used when rsync is talking
to another 3.x version. This starts the transfer going more quickly
(before all the files have been found), and requires much less memory.
See the –recursive option in the manpage for some restrictions.
rsync, like others have already recommended. If the CPU overhead from the encryption is a bottleneck, use another less CPU intensive algorithm, like blowfish. E.g. something like
rsync -ax -e 'ssh -c blowfish' /local/path user@host:/remote/path
In moving 80 TB of data (millions of tiny files) yesterday, switching from
tar proved to be much faster, as we stopped trying
# slow rsync -av --progress /mnt/backups/section01/ /mnt/destination01/section01
and switched to
# fast cd /mnt/backups/ tar -cf - section01 | tar -xf - -C /mnt/destination01/
Since these servers are on the same LAN, the destination is NFS-mounted on the source system, which is doing the push. No make it even faster, we decided not to preserve the
atime of files:
mount -o remount,noatime /mnt/backups mount -o remount,noatime /mnt/destination01
The graphic below depicts the difference the change from rsync to tar made. It was my boss’s idea and my colleague both executed it and made the great writeup on his blog. I just like pretty pictures. 🙂
When copying a large number of files, I found that tools like tar and rsync are more inefficient than they need to be because of the overhead of opening and closing many files. I wrote an open source tool called fast-archiver that is faster than tar for these scenarios: https://github.com/replicon/fast-archiver; it works faster by performing multiple concurrent file operations.
Here’s an example of fast-archiver vs. tar on a backup of over two million files; fast-archiver takes 27 minutes to archive, vs. tar taking 1 hour 23 minutes.
$ time fast-archiver -c -o /dev/null /db/data skipping symbolic link /db/data/pg_xlog 1008.92user 663.00system 27:38.27elapsed 100%CPU (0avgtext+0avgdata 24352maxresident)k 0inputs+0outputs (0major+1732minor)pagefaults 0swaps $ time tar -cf - /db/data | cat > /dev/null tar: Removing leading `/' from member names tar: /db/data/base/16408/12445.2: file changed as we read it tar: /db/data/base/16408/12464: file changed as we read it 32.68user 375.19system 1:23:23elapsed 8%CPU (0avgtext+0avgdata 81744maxresident)k 0inputs+0outputs (0major+5163minor)pagefaults 0swaps
To transfer files between servers, you can use fast-archiver with ssh, like this:
ssh email@example.com "cd /db; fast-archive -c data --exclude=data/\*.pid" | fast-archiver -x
I use the tar through
netcat approach as well, except I prefer to use
socat — a lot more power to optimize for your situation — for example, by tweaking mss. (Also, laugh if you want, but I find
socat arguments easier to remember because they’re consistent). So for me, this is very very common lately as I’ve been moving things to new servers:
host1$ tar cvf - filespec | socat stdin tcp4:host2:portnum host2$ socat tcp4-listen:portnum stdout | tar xvpf -
Aliases are optional.
Another alternative is Unison. Might be slightly more efficient than Rsync in this case, and it’s somewhat easier to set up a listener.
Looks like there may be a couple of typos in the top answer. This may work better:
tar -cf - /path/to/dir | ssh remote_server 'tar -xvf - -C /path/to/remotedir'
- Network File System (NFS) and then copy them with whatever you like, e.g. Midnight Commander (mc), Nautilus (from gnome). I have used NFS v3 with good results.
- Samba (CIFS) and then copy the files with whatever you want to, but I have no idea how efficient it is.
- HTTP with
wget --mirroras Evan Anderson has suggested or any other http client. Be careful not to have any nasty symlinks or misleading index files. If all you have is MP3s, you should be safe.
- rsync. I have used it with pretty good results and one of its nice features is that you can interrupt and resume the transfer later.
I’ve noticed that other people have recommended using netcat. Based on my experience with it I can say that it’s slow compared with the other solutions.
Thanks to Scott Pack’s wonderful answer (I didn’t know how to do this with ssh before), I can offer this improvement (if
bash is your shell). This will add parallel compression, a progress indicator and check integrity across the network link:
tar c file_list | tee >(sha512sum >&2) | pv -prab | pigz -9 | ssh [user@]remote_host ' gunzip | tee >(sha512sum >&2) | tar xC /directory/to/extract/to '
pv is a nice progress viewer program for your pipe and
pigz is a parallel gzip program that uses as many threads as your CPU has by default (I believe up to 8 max). You can tune the compression level to better fit the ratio of CPU to network bandwith and swap it out with
pxz -9e and
pxz -d if you have much more CPU than bandwidth. You only have to verify that the two sums match upon completion.
This option is useful for very large amounts of data as well as high latency networks, but not very helpful if the link is unstable and drops. In those cases, rsync is probably the best choice as it can resume.
6c1fe5a75cc0280709a794bdfd23d7b8b655f0bbb4c320e59729c5cd952b4b1f84861b52d1eddb601259e78249d3e6618f8a1edbd20b281d6cd15f80c8593c3e - ] 176MiB [9.36MiB/s] [9.36MiB/s] [ <=> ] 6c1fe5a75cc0280709a794bdfd23d7b8b655f0bbb4c320e59729c5cd952b4b1f84861b52d1eddb601259e78249d3e6618f8a1edbd20b281d6cd15f80c8593c3e -
For block devices:
dd if=/dev/src_device bs=1024k | tee >(sha512sum >&2) | pv -prab | pigz -9 | ssh [user@]remote_host ' gunzip | tee >(sha512sum >&2) | dd of=/dev/src_device bs=1024k '
Obviously, make sure they’re the same size or limit with count=, skip=, seek=, etc.
When I copy filesystems this way, I’ll often first
dd if=/dev/zero of=/thefs/zero.dat bs=64k && sync && rm /thefs/zero.dat && umount /thefs to zero most of the unused space, which speeds up the xfer.
I don’t think you’re going to do any better than scp unless you install faster network cards. If you’re doing this over the internet, that will not help though.
I would recommend using rsync. It may not be any faster, but at least if it fails (or you shut it down because it’s taking too long), you can resume where you left off next time.
If you can connect the 2 machines directly using gigabit ethernet, that will probably be the fastest.
For 100Mb/s the theoretical throughput is 12.5 MB/s, so at 10MB/s you are doing pretty well.
I would also echo the suggestion to do rsync, probably through ssh. Something like:
rsync -avW -e ssh $SOURCE $USER@$REMOTE:$DEST
At 100Mb/s your CPUs should be able to handle the encrypt/decrypt without appreciably impacting the data rate. And if you interrupt the data flow, you should be able to resume from where you left off. Beware, with “millions” of files the startup will take a while before it actually transfers anything.
I’ve encountered this, except that I was transferring Oracle logs.
Here’s the breakdown
inefficient and encrypted (encrypted = slower than unencrypted depending on the link and your processor)
efficient but typically encrypted (though not necessarily)
both seem to be efficient, and both are plaintext.
I used FTP with great success (where great success is equivalent to ~700Mb/s on a Gb network). If you’re getting 10MB (which is equal to 80Mb/s), something is probably wrong.
What can you tell us about the source and destination of the data? Is it single drive to single drive? RAID to USB?
I know this question already has an answer, but if your network is going this slow on a Gb/s crossover cable, something absolutely needs fixed.
You didn’t mention if the two machines are on the same LAN, or if a secure channel (i.e. using SSH) is mandatory, but another tool you could use is netcat.
I would use the following on the receiving machine:
cd <destdir> netcat -l -p <port> | gunzip | cpio -i -d -m
Then on the sending side:
cd <srcdir> find . -type f | cpio -o | gzip -1 | netcat <desthost> <port>
It has the following advantages:
- No CPU overhead for the encryption that ssh has.
gzip -1provides light compression without saturating a CPU so it makes a good trade-off, giving a bit of compression while maintaining maximum throughput. (Probably not that advantageous for MP3 data, but doesn’t hurt.)
- If you can partition the files up into groups, you can run a two or more pipes in parallel and really ensure you’re saturating your network bandwidth.
find <dir1> <dir2> -type f | cpio -o | gzip -1 | netcat <desthost> <portone> find <dir3> <dir4> -type f | cpio -o | gzip -1 | netcat <desthost> <porttwo>
- Whatever way you transfer, I would probably run an rsync or unison afterwards to ensure you got everything.
- You could use
cpioif you prefer.
- Even if you do end up using ssh, I would ensure it is not using any compression itself, and pipe through
gzip -1yourself instead to avoid CPU saturation. (Or at least set the CompressionLevel to 1.)
A simple scp with proper options will easily reach 9-10 MB/s over LAN:
scp -C -c arcfour256 ./local/files.mp3 remoteuser@remoteserver:/opt/remote
With those options it’s likely that the throughput became 4x or 5x faster than no options (default)
If you have ftp server in src side, you can use ncftpget from ncftp site. It works prefect with small files as it uses tar internally.
One comparison shows this: moving 1.9GB small files (33926 files)
- Using scp takes 11m59s
- Using rsync takes 7m10s
- Using ncftpget takes 1m20s
You can also try using the BBCP command to do your transfer. It’s a buffered parallel ssh that really screams. We can usually get 90%+ line-rate provided we can keep the pipe fed.
$ bbcp -s 8 -w 64M -N io 'tar -cO srcdirectory' desthostname:'tar -x -C destdir'
Normally, we try real hard to avoid having to move suff around. We use ZFS pools that we can always just “add” more disk space to. But sometimes… you just have to move stuff. If we have a “live” filesystem that may take hours (or days) to copy even when going full-blast.. we do the ole two step zfs send routine:
- Make a ZFS snapshot, and transfer to the new pool on the new machine. Let it take as long as it takes.
- Make a second snapshot, and send it as an incremental. The incremental snapshot only includes the (much smaller) change-set since the first, so it goes through relatively quick.
- Once the incremental snapshot is completed you can turn of the original and cut over to the new copy and your “offline downtime” is kept to a minimum.
We also send our zfs dumps over BBCP as well… it maximizes our network utilization and minimizes the transfer times.
BBCP is freely available, you can google it, and it’s a straight-foward compile. Just copy it into your /usr/local/bin on both src and destination machines and it’ll pretty much just work.
I guess my answer is a little late here, but I made good experiences with using mc (Midnight Commander) on one server to connect via SFTP to the other server.
The option to connect via FTP is in the “Left” and “Right” menus, by entering the address like this:
The you can navigate and do file operations almost like on a local filesystem.
It has a built-in option to do the copying in background, but I prefer using the screen command and detach from the screen while mc is copying (I think it runs faster then too).
Here is a quick benchmark to compare some techniques,
- Source is a 4-core Intel(R) Xeon(R) CPU E5-1620 @ 3.60GHz with 250
Mbps and SATA drive
- Destination is a 6-core Intel(R) Xeon(R) CPU
E-2136 @ 3.30GHz with 1 Gbps bandwidth and SSD drive
Number of files : 9632,
Total size : 814 MiB,
Avg size : 84 KiB
- RSYNC : 1m40.570s
- RSYNC + COMPRESSION : 0m26.519s
- TAR + NETCAT : 1m58.763s
- TAR + COMPRESSION + NETCAT : 0m28.009s
Command for tar/netcat was :
Source : tar -cf - /sourcedir/ | nc -v 188.8.131.52 5000 Dest : nc -v -l 5000 | tar -xf -
rsync or you might wish to tar it so its all within one file and then scp. If you lack the diskspace you can pipe the tar directly over ssh while its being made.
If you’re sending over MP3’s and other compressed files, you won’t gain much from any solution that tries to further compress those files. The solution would be something that can create multiple connections between both servers and thus put more stress on the bandwidth between the two system. Once this maxes out, there’s not much that can be gained without improving your hardware. (Faster network cards between those servers, for example.)
I tried couple of tools for copying a 1GB file
The result is below:
HTTP the fastest, with wget -c
nc second in line
scp slowest, and failed couple of times. No way to resume
rsync uses ssh as a backend, thus the same result.
In conclusion, I would go for http with wget -bqc and give it some time.
Hope that this helps
I had to copy the BackupPC disk into another machine.
I used rsync.
The machine had 256 MB of memory.
The procedure I followed was this one:
-H(took 9 hours)
- when rsync finished, I synchronized the
cpooldirectory and started with the
pcdirectory; I cut the transfer.
- then restarted
-Hflag, and all the files hard linked in
pcdirectory were correctly transfered (the procedure found all the real files in
cpooland then linked to the
pcdirectory) (took 3 hours).
In the end I could verify with
df -m that no extra space was spent.
By this way I elude the problem with the memory and rsync. All time I can verify the performance using top and atop and finally I transferred 165GB of data.
Disclaimer: This has been sourced from a third party syndicated feed through internet. We are not responsibility or liability for its dependability, trustworthiness, reliability and data of the text. We reserves the sole right to alter, delete or remove (without notice) the content in its absolute discretion for any reason whatsoever.