De-Duping files on BTRFS.
Brave souls can test
BTRFS for a couple of
Fedora releases.\
\
Removing duplicate/redundant files on filesystems is a common thing,
e.g. when creating regular backups or so. On ext4 this can be realized
using traditional hardlinks.\
Hardlinks all point to the same blocks on the logical drive below. So if
a write happens to one of the hardlinks, this also “appears” in all
other hardlinks (which point to he same - modified - block).\
This is no problem in a backup scenario, as you normally don’t modify
backuped files.\
\
In my case I wanted to remove redundant files that might get modified
and the changes shouldn’t be reflected in all other copies. So what I
want to achieve is to let several links (files) point to the same block
for reading, but if a write happens to one block this should be just
happen to the one file (link). So, copy the file on write. Wait, don’t
we know that as CoW? Yep.\
\
Luckily BTRFS allows cow files using the cp --reflink
command.\
The following snippet replaces all copies of a file with “light weight”
aka cow copies.\
\
#!/bin/bash
# Usage: dedup.sh PATH_TO_HIER_WITH_MANY_EXPECTED_DUPES
mkdir sums
find $@ -type f -print0 | while read -d $'\0' -r F
do
echo -n "$F : "
FHASH=$(sha256sum "$F" | cut -d" " -f1);
# If hashed, it's probably a dupe, compare bytewise
# and create a reflink (so cow)
if [[ -f "sums/$FHASH" ]] && cmp -s "sums/$FHASH" "$F";
then
echo "Dup." ;
rm "$F" ;
cp --reflink "sums/$FHASH" "$F" ;
# It's a new file, create a hash entry.
else
echo "New." ;
cp --reflink "$F" "sums/$FHASH" ;
fi
done
rm sums/*
rmdir sums
\ And in general, btrfs didn’t yet eat my data, it even survived two power losses …\ Update: Updated to handle files with special characters. This script also makes some assumptions, e.g. the files should not be modified while running this script.
::: {#footer} [ December 15th, 2011 8:54pm ]{#timestamp} [fs]{.tag} [fedora]{.tag} [file]{.tag} [duplicate]{.tag} [btrfs]{.tag} :::