De-Duping files on BTRFS.

Brave souls can test BTRFS for a couple of Fedora releases.\ \ Removing duplicate/redundant files on filesystems is a common thing, e.g. when creating regular backups or so. On ext4 this can be realized using traditional hardlinks.\ Hardlinks all point to the same blocks on the logical drive below. So if a write happens to one of the hardlinks, this also “appears” in all other hardlinks (which point to he same - modified - block).\ This is no problem in a backup scenario, as you normally don’t modify backuped files.\ \ In my case I wanted to remove redundant files that might get modified and the changes shouldn’t be reflected in all other copies. So what I want to achieve is to let several links (files) point to the same block for reading, but if a write happens to one block this should be just happen to the one file (link). So, copy the file on write. Wait, don’t we know that as CoW? Yep.\ \ Luckily BTRFS allows cow files using the cp --reflink command.\ The following snippet replaces all copies of a file with “light weight” aka cow copies.\ \

#!/bin/bash
# Usage: dedup.sh PATH_TO_HIER_WITH_MANY_EXPECTED_DUPES
mkdir sums
find $@ -type f -print0 | while read -d $'\0' -r F
do
  echo -n "$F : "
  FHASH=$(sha256sum "$F" | cut -d" " -f1);
  # If hashed, it's probably a dupe, compare bytewise 
  # and create a reflink (so cow)
  if [[ -f "sums/$FHASH" ]] && cmp -s "sums/$FHASH" "$F";
  then
    echo "Dup." ;
    rm "$F" ;
    cp --reflink "sums/$FHASH" "$F" ;

  # It's a new file, create a hash entry.
  else
    echo "New." ;
    cp --reflink "$F" "sums/$FHASH" ;
  fi
done
rm sums/*
rmdir sums

\ And in general, btrfs didn’t yet eat my data, it even survived two power losses …\ Update: Updated to handle files with special characters. This script also makes some assumptions, e.g. the files should not be modified while running this script.

::: {#footer} [ December 15th, 2011 8:54pm ]{#timestamp} [fs]{.tag} [fedora]{.tag} [file]{.tag} [duplicate]{.tag} [btrfs]{.tag} :::