TCLUG Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Mind Bender for CS Linux Geek



Problem
=======

You have a hierarchy of directories and files that you would
like to write to a CD for distribution.  However there
is a problem.  There is not enough room on the CD to do
a straight copy.

Fortunately, many of the files contain exactly the same data,
and there is enough room on the CD to hold one copy of each
unique file.   The iso9660 CD filesystem allows you to
take advantage of this fact, because it allows multiple
filenames that point to the same physical file.

(The author first saw this technique used by Sun to distribute
 a single installation disk for multiple architectures, and it
 is used today by Microsoft to distribute CDs with multiple
 versions and/or multiple language editions.)

To pull this off on say a Linux system, you copy all the files
to a subdirectory, linking identical files together.  You
then build a CD image file with "mkisofs" and mount the image
file with "mount -o loop" to verify it is correct.  Then you
write the image file to a CDR with "cdrecord".

Most of this is pretty straightforward, except for one critical
step.  You must find all the files that have identical data.


Task
====

Assume all the original files are in a named subdirectory.
Using standard Linux tools, describe how to copy these files
to a second subdirectory, using hard links to make only one
copy of each file with identical contents.


Constraints
===========

(1) Your method should use standard UNIX tools, possibly
    with a small C program or two.

(2) The method should not make assumptions about the relative
    number of files, their size or content.

(3) The computation required should be approximately linear
    function of the number of files and the total size of the
    input.

(4) The method should require very moderate memory and
    disk storage.

(5) The task should be fairly simple.  Given a good description
    of what to do, an average shell programmer should be able
    to instrument the entire CD build script in no more than
    a day.


Extra Credit
============

Describe how to make the process a worst-case strictly linear
function of the number of files, and the total size of the input
data.


Comment
=======

There are many other uses for this algorithm.  It can be used
find unnecessary/confusing duplicate files, to detect piracy
or find copies of stolen files, or to save space on a web
server with duplicate *.gif and *.jpg files.


References
==========

For other similar problems, see "The Art of Computer Programming",
Volume 3, "Sorting and Searching" by Don Knuth.



If there is any interest, I will post a solution in a few days.
_________________________________________________________________________
     __ 
    /  )                Gene H. Olson             Email: gene@thinSoft.com
   / __  _ __  _        ThinSoft Corporation      Voice: +1 612 414-4590
  (__/ _(/_//_(/_       Minneapolis, MN, USA      Fax:   +1 612 824-1818