Notes for Week 8

Most software is distributed via CD-ROM, which constitutes its own backup. The lifetime of a manufactured CD-ROM is almost certainly longer than the period of usefulness of its contents, and even CD-Rs and DVD-Rs are expected to last up to 20 years when kept away from extremes of temperature and sunlight. It therefore behooves the system administrator to segregate the files (whenever possible) which are created or modified after software installation is complete. By carefully backing up system configuration files and ensuring that all non-system files reside in /home (for example), the job of determining what needs to be backed up is made much easier.

All backups should be multi-generational: you should never have just one copy of the backup media (disk or tape). If a system failure should occur while writing to your only backup, both the backup and the original are both potentially lost. If you back up to, for example, flash drives, keep some number (n) of them and number them; after using #1, use #2 the next time you back up, #3 the time after that, etc. After using #n, the next backup will be on #1, and so on. This not only protects previous backups while creating new ones, but also gives you a recent file history, so that problems can be tracked, or files deleted since the last backup rotation can be restored.

The amount of disk space taken up by files modified or created since software installation may be larger than you want to back up every day. In that case, you can back up incrementally on a daily basis, backing up only those files that have been modified since the previous backup, and then do a complete backup on, for example, a weekly basis. When using incremental backups, your partial backups form one multi-generational set, and your complete backups form another. Typically, complete backups might be to tape, CD-R or DVD-R, while incremental backups might be to flash drive or tape. Floppy disks are not stable enough to be relied upon for backup purposes.
Be aware that flash drives have a limited number of write cycles to each location; they do go bad!

Backups can be made remotely via network connection, but keep in mind that network transmission is not foolproof: CRC and checksums are good, but errors still go through undetected (roughly one in three million ethernet packets will have an undetected error, given that one bit error is undetected in two to the N for CRC-N). Off-site storage of complete backups are a good idea, since the possibility of fire or natural disaster is always present, but the odds of multiple sites incurring simultaneous loss are extremely small.

The tar command is used to make backups (much like WINZIP in Windows); tar output is called a tarball (with typical suffix ".tar"), and if compressed, is called a "zipped tarball" (with typical suffix ".tar.gz" or ".tgz"). Several variants of the tar command are useful; note that all use the z option to compress the backup (the j option will do slightly better compression but takes considerably longer):
- tar -czf /media/usbstg/home.tgz /home/* (to backup the contents of home to the device mounted at /media/usbstg)
  Note that you have to be careful using the wildcard asterisk. If Bob wanted to backup his home directory and specified "/home/bob/*", he would not get any of the so-called "hidden" files in the directory /home/bob whose names begin with a period (he would get the hidden files in any subdirectories of /home/bob). If he specified "/home/bob/", he would get everything.
- tar -czf backup.tgz /home/* (to backup the contents of /home to a file on disk, which can later be copied to the backup media; this gives an added measure of redundancy, if enough disk space is available)
- tar -czf files.tgz -T list_of_files.txt (to backup those files listed in the file list_of_files.txt)
- tar -czPf backup.tgz /home/* (-P retains leading "/" in the filename, so that restored files may overwrite the originals; note that to extract the files into their original locations, you will need -xzPf)
- tar -tzf backup.tgz (to list those files on the backup)
- tar -xvzf backup.tgz -T list_of_files.txt (to extract some files from the backup)
- tar -xzpf backup.tgz (extracts permissions as well as file contents)
- tar -xzmf backup.tgz (updates the modification time stamp on the file to the current time)
- tar -xjf backup.tar.bz2 (extracts a tarball compressed using bzip2)
The v(erbose) option above causes tar to list the files that are being (un)tarred.
In the examples above, root can backup the user files in the /home/ directory to a file in /root, but how does an ordinary user back up their home directory to a file? They can use /tmp/ - anyone can write to it, but anyone can also read from it. To get around the security implications of that, first touch the backup file in /tmp/, then chmod the file so that only you can read from it, and be sure to erase it when you are done.
Our distribution has umask set to 0007, so a user can place a file in /tmp without worrying about other users reading it.

You might also want to exclude some directories from your backup; for example, before the "/home/*" you might add
--exclude='home/*/.cache/*'

The find command can be used to create a list of files for input to another program (ie., tar). It has many options, but a few of the most useful are demonstrated here:
- find starting-directory -name pattern
  Here, "starting-directory" is the place in the directory tree at which to begin the search (the search will continue through the entire directory sub-tree from this vertex), and "pattern" is a file name pattern (either a full file or directory name, or one that uses wild cards, ie. '*.mpg' or 'X*').
- find starting-directory -iname pattern
  In this example, the file name search will be case insensitive.
- find starting-directory -newer path
  Here "path" is a file or directory, and only those files are found which have been modified more recently than path.
  The newer option is very convenient in designing backup scripts, in conjunction with the touch command: if you touch /root/.backup at the end of each backup, find / -newer /root/.backup will find all files and directories modified since the last backup.

Any of the above find commands will find only files (and not directories) if "-type f" is used. This is helpful in backup scripts, since a directory which has only been modified with a single file deletion will be found without using "-type f", but such a directory obviously does not need backing up. On the other hand, a full backup will want to include directories, so that all existing directories are recreatable, even if they are empty.
The option "-type d" will find only directories.
Any of the find commands will produce output more like ls -l by using the option "-ls". This is useful if information about the found files is needed.
All of these options should come after the "starting-directory".

- often means stdin, allowing commands to be piped:
find / -newer /root/.backup -type f | tar -czf backup.tgz -T -
Here, the files modified since the last backup will be placed in the gzipped tarball "backup.tgz". The "-T" option to tar tells it to get the list of files to place in the tarball from a file, and "-T -" specifies that stdin is the file to be used.
By piping stdout from the find command (which is of course the list of file names) into stdin for the tar command, we avoid the necessity of creating a temporary file on disk with the list of file names to be backed up. Note that stdin and stdout (as well as stderr) are streams: a stream of data can only be scanned once (the water in a stream passes by only once; it never returns as long as you ignore evaporation and rain). This means that some tar options won't work, since the list of file names cannot be rescanned (ie., -W to verify the tarball).
Since tar stores path names it is sometimes hard to remember exactly what filename to use when extracting a specific file. You can use a pipe to make the job easier:
tar -tzf backup.tgz | grep -e 'filename' | tar -xvzf backup.tgz -T -
will list the table of contents of the tarball, grep the one with the filename you are looking for (but with the path information tar needs to find it) and extract that file.

It is often helpful to include the date in the name of a backup file. An example might be:
tar -czf backup.$(date +%y%m%d).tgz /home/*
The "$(command)" inserts the output of the command into the filename; the "+%y%m%d" parameter to the date command gives the date in the form "YYMMDD", which is convenient for allowing ls to show the backup files in order.

There are two commands which can be used in or out of scripts to compare files: diff (like FC in Windows), which compares two text files, and cmp (like COMP in Windows), which compares two binary files. These commands are extremely useful when checking to see what has changed between (for instance) a configuration file and its backup copy, or to see if two binaries (programs) are the same. For instance:
cmp backup.tgz /media/usbstg/backup.tgz
will compare the backup tarball on the hard drive with the one you just wrote to the backup flash drive.
Note that if you write a file to disk and then use cmp to see if it has been written correctly, you must first umount and re-mount the disk. This guarantees that you are not checking the actual file on the disk, and not just the copy in cache.

You can use md5sum to detect when files have been modified (after transmission across a network, or if searching for possible intrusion):
md5sum filename > filename.md5sum
will compute the checksum and
md5sum -c filename.md5sum
will check the file against the previously computed checksum.

EXERCISES for Week 8:
1. Write and debug a shell script to backup your home directory to the media of your choice. Be sure to handle all mounting and unmounting. The script should also check the backup tarball against the original to make sure that it was backed up correctly.
  Be sure to test it!
  Your backup script should be in "/root/bin/backup".
  If you want to get fancy, try this:
  if [ "$1" = "full" ]; then
  (place your full backup code here)
  else
  (place your incremental backup code here)
  fi
  "$1" is the first positional parameter to a script, so backup would do an incremental backup while backup full would do a full backup.
2. Find all of the files in /var which have been changed since you installed your system. Place the output in the file "/root/installed.files".
3. Once a hacker breaks into your system, he or she immediately replaces several key programs with modified versions which will hide the hacker's tracks. These include login, ls and ps. Compute checksums for these files and verify them. Why would you do this for a root filesystem which is mounted read-only?
  Place the output of the md5sum commands in /root/ (with filenames ending in ".md5").

Please send comments or suggestions to the author.