Tuesday, March 10, 2015

Using cp to Copy a Lot of Files

Rasmus Borup Hansen (via Hacker News):

Having almost used up the capacity we decided to order another storage enclosure, copy the files from the old one to the new one, and then get the old one into a trustworthy state and use it to extend the total capacity. Normally I’d have copied/moved the files at block-level (eg. using dd or pvmove), but suspecting bad blocks, I went for a file-level copy because then I’d know which files contained the bad blocks. I browsed the net for other peoples’ experience with copying many files and quickly decided that cp would do the job nicely. Knowing that preserving the hardlinks would require bookkeeping of which files have already been copied I also ordered 8 GB more RAM for the server and configured more swap space.

[…]

After some days of copying the first real surprise came: I noticed that the copying had stopped, and cp did not make any system calls at all according to strace. Reading the source code revealed that cp keeps track of which files have been copied in a hash table that now and then has to be resized to avoid too many collisions. When the RAM has been used up, this becomes a slow operation.

Trusting that resizing the hash table would eventually finish, the cp command was allowed to continue, and after a while it started copying again. It stopped again and resized the hash table a couple of times, each taking more and more time. Finally, after 10 days of copying and hash table resizing, the new file system used as many blocks and inodes as the old one according to df, but to my surprise the cp command didn’t exit. Looking at the source again, I found that cp disassembles its hash table data structures nicely after copying (the forget_all call). Since the virtual size of the cp process was now more than 17 GB and the server only had 10 GB of RAM, it did a lot of swapping.

As far as I know, the Mac version of cp does not preserve hard links.

3 Comments RSS · Twitter

On the Mac I typically use ditto or SuperDuper! for this kind of thing. On other platforms I use rsync. I never even thought about using cp, which on its surface does seem a little strange – but after reading the above, I guess it was a happy coincidence.

I stopped using cp a long time ago. It is a source of too many issues (symlink followed, poor hard link and attributes support, …).

When cloning directory, I'd rather use ditto or rsync which are far better at doing it.

If I have to copy a lot of files I always use rsync. I only use cp for when there are a small number of files.

Leave a Comment