Project

digestif

0.0
No commit activity in last 3 years
No release in over 3 years
Digestif lets you create fast checksums of large files by skipping sections of the file. It was created with compressed media files in mind, which generally have such a high information density that we can get away with a checksum that doesn't actually consider all the bits.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

 Project Readme

Digestif

An aid for creating hash digests of large files

Synopsis

Digestif lets you create fast checksums of large files by
skipping sections of the file. It was created with compressed media
files in mind, which generally have such a high information density
that we can get away with a checksum that doesn’t actually consider all
the bits. Someday I’d like to understand the likelyhood-of-collision
implications for specific compression algorithms (mp3, h.264, xvid, et al.),
but right now I’m going to settle for guessing at where “good enough for me”
might lie.

One side-effect of this approach is that the error-corrective nature of
digests is, of course, lost. This is really more of an inescapable artifact
of the problem we’re trying to solve. To create a hash of a really large
file, the biggest bottleneck with modern computers is streaming
5-10 gigs off of the disk. The actual checksumming is not hard.
By looking at less data, we speed up the hash process immensely, and
we incur the cost of vulnerability of file corruption. Because the
purpose I have in mind for this tool is identity checking, not
corruption detection, this issue is not a problem for me.

Installation

gem install digestif

Usage

Just like md5 on the command line, but it only works on files, not on
streaming data (can’t seek a stream).

digestif some_large_file

Since this program is designed to get around file limitations specifically, it
didn’t make sense for me to invest in making streams work.

For a detailed look at the options, see

digestif --help

Motivation

I wrote digestif to solve a problem for a media catalogue I was working on.
I wanted a filename-independent way to evaluate whether or not a file was in
the catalogue yet, but the files were so large that streaming the whole file
off of the hard drive was too slow for the response time I was hoping for.
(Interested parties, I was getting 5 gigs hashed using md5 in about 2.4
minutes.)

Author

Copyright 2011 Andrew Roberts