Project 3 - File System
Due: Wednesday, December 4, 2002. 11:59 p.m.
For under $1000, you can now purchase more disk space than was
available on the largest supercomputing systems of only ten years ago
(systems that used to cost upwards of $10 million dollars). This
trend shows no signs of stopping---within a few years, home computers
will likely have a terabyte or more of disk storage. Needless to
say, all of this storage capacity opens up new possibilities for file
system research. After all, aside from storing lots of MP3s, there
must be some other things that you could do with all of that disk
space.
One irony of large disks is that as capacity increases, data management
becomes more difficult. Over time, the filesystem may store hundreds
of thousands of files (case in point: Dave's home directory currently
contains 242315 files, not counting this one). Moreover, there is
often little incentive to delete anything. After all, disks are so
cheap that if you run out of space, it is often easier to just buy
another disk than spend several weeks trying to clean up the
mess.
Needless to say, the data management issue has generated much thought
in the file system community. For example, most systems provide
simple tools for searching the file system. Some experimental file
systems have even provided versioning--that is, they store copies of every
file that has ever existed and all versions of a file that might have
been modified. These systems are interesting because you can often "roll back" the
filesystem to an arbitrary date (e.g., modify the "cd" command with some kind of
magic time warp feature).
In this project, you are going to design an unusual file
system that tries to tame some of the data management complexity. Of course,
it will eat up some of disk space as well.
The "Links" File System
In a normal file system, files contain both data and metadata.
Metadata is the information that would be stored in the file inode and
displayed in a directory listing. For example, the name of the file,
the size, permissions, ownership, modification times, and so forth.
One notable ommission from the metadata is information about file
usage. For example, what program was used to generate a
particular file? Or what programs have used a file as
input?
The lack of this information is subtle, but consider the utility of
such information if it were available:
- You find a suspicious file and you're concerned about system security--perhaps
your system got hacked. What program created this file? What programs are using
this file?
- You're cleaning up the system and you want to remove a file. However, you're
not quite sure what the file is for. Can you produce a list of all programs that
have used the file?
- For some odd reason, you're poking around in some old class
directory at a project you did five years ago. You come
across an interesting plot of some data. However, you can't quite
remember what data was used to generate the plot. Can you query the file
system and ask it "what program was used to make this plot?" Furthermore,
can you ask the file system to produce the input files that were used?
- You find a critical floating point error in a shared library. Can you identify all
data files on the system that were generated by programs that used that library?
- You're teaching an OS class and you're concerned about the "sharing" of code.
Given a student kernel, can you construct a dependency graph that shows the
relationship of the kernel to all other files on the file system?
Your task in this project is to design a file system (or file system modification)
that allows a user to make these types of queries about file usage. In a sense,
the file system will maintain a collection of links that define the relationship
between files and programs that created them.
To do this, you need to figure out how to merge file metadata with
process accounting data in some kind of coherent manner. Specifically,
for each process that runs on the machine, a record that minimally records
the following information needs to be created:
- What program ran? (i.e., which executable file was used).
- What files were used as input?
- What files were generated as output?
Using this information, it is possible to construct a large graph that
encodes dependencies between files and programs. This information, in turn,
can be used to answer questions about the origin and use of specific files.
Your Task
Your task is to think about how you might modify the operating system and/or the
file system to provide this extra functionality. To do this, you need to
address the following questions:
Part I : Data Collection
- How would go about collecting file information from a process? In other words,
how would you collect information about opened files, read files, modified files, etc?
- What kind of information would be collected and how much space would it take?
- What is the performance impact of collecting this information?
Part II : Data Representation
- How would you incorporate process data with file metadata? Where would the
extra information be stored? How would it be stored?
- Would you modify inodes to include extra information? If so, what would
you add? If not, how would you store extra information?
- Are there any ways to reduce the amount of stored information?
Part III: Queries
- Given a file, how would you display a list of all programs that read
that file? How long would it take?
- Given a file, how would you list the program that created it? How
long would it take?
- How would the file system be used to answer the example queries in the previous
section?
Part IV: User Interface
Think of a nice user interface for performing file queries. What would be the
easiest way for a user to use the file system?
Part V: Corner Cases
- Are there any tasks where recording this extra information would be problematic?
For example, it generates too much data or recording it is irrelevant.
- A lot of programs generate short-lived temporary files. How would the file
system handle temporaries?
- More generally, how would the file system handle file removal? How is the metadata
updated?
- How would the file system deal with mundane tasks like backups, copies, moves, file renaming, etc?
- Are there any other corner cases?
Part VI: Random Thoughts
- Would building this file system be feasible? If so, why? If not, why not?
- Do you think a file system like this would be useful?
Grading
There is no right solution to this project. Your grade is determined by
the level of detail provided in answers to the above questions.