Project 3 - File System

Due: Wednesday, December 4, 2002. 11:59 p.m.

For under $1000, you can now purchase more disk space than was available on the largest supercomputing systems of only ten years ago (systems that used to cost upwards of $10 million dollars). This trend shows no signs of stopping---within a few years, home computers will likely have a terabyte or more of disk storage. Needless to say, all of this storage capacity opens up new possibilities for file system research. After all, aside from storing lots of MP3s, there must be some other things that you could do with all of that disk space.

One irony of large disks is that as capacity increases, data management becomes more difficult. Over time, the filesystem may store hundreds of thousands of files (case in point: Dave's home directory currently contains 242315 files, not counting this one). Moreover, there is often little incentive to delete anything. After all, disks are so cheap that if you run out of space, it is often easier to just buy another disk than spend several weeks trying to clean up the mess.

Needless to say, the data management issue has generated much thought in the file system community. For example, most systems provide simple tools for searching the file system. Some experimental file systems have even provided versioning--that is, they store copies of every file that has ever existed and all versions of a file that might have been modified. These systems are interesting because you can often "roll back" the filesystem to an arbitrary date (e.g., modify the "cd" command with some kind of magic time warp feature).

In this project, you are going to design an unusual file system that tries to tame some of the data management complexity. Of course, it will eat up some of disk space as well.

The "Links" File System

In a normal file system, files contain both data and metadata. Metadata is the information that would be stored in the file inode and displayed in a directory listing. For example, the name of the file, the size, permissions, ownership, modification times, and so forth. One notable ommission from the metadata is information about file usage. For example, what program was used to generate a particular file? Or what programs have used a file as input?

The lack of this information is subtle, but consider the utility of such information if it were available:

You find a suspicious file and you're concerned about system security--perhaps your system got hacked. What program created this file? What programs are using this file?
You're cleaning up the system and you want to remove a file. However, you're not quite sure what the file is for. Can you produce a list of all programs that have used the file?
For some odd reason, you're poking around in some old class directory at a project you did five years ago. You come across an interesting plot of some data. However, you can't quite remember what data was used to generate the plot. Can you query the file system and ask it "what program was used to make this plot?" Furthermore, can you ask the file system to produce the input files that were used?
You find a critical floating point error in a shared library. Can you identify all data files on the system that were generated by programs that used that library?
You're teaching an OS class and you're concerned about the "sharing" of code. Given a student kernel, can you construct a dependency graph that shows the relationship of the kernel to all other files on the file system?

Your task in this project is to design a file system (or file system modification) that allows a user to make these types of queries about file usage. In a sense, the file system will maintain a collection of links that define the relationship between files and programs that created them. To do this, you need to figure out how to merge file metadata with process accounting data in some kind of coherent manner. Specifically, for each process that runs on the machine, a record that minimally records the following information needs to be created:

What program ran? (i.e., which executable file was used).
What files were used as input?
What files were generated as output?

Using this information, it is possible to construct a large graph that encodes dependencies between files and programs. This information, in turn, can be used to answer questions about the origin and use of specific files.

Your Task

Your task is to think about how you might modify the operating system and/or the file system to provide this extra functionality. To do this, you need to address the following questions:

Part I : Data Collection

How would go about collecting file information from a process? In other words, how would you collect information about opened files, read files, modified files, etc?
What kind of information would be collected and how much space would it take?
What is the performance impact of collecting this information?

Part II : Data Representation

How would you incorporate process data with file metadata? Where would the extra information be stored? How would it be stored?
Would you modify inodes to include extra information? If so, what would you add? If not, how would you store extra information?
Are there any ways to reduce the amount of stored information?

Part III: Queries

Given a file, how would you display a list of all programs that read that file? How long would it take?
Given a file, how would you list the program that created it? How long would it take?
How would the file system be used to answer the example queries in the previous section?

Part IV: User Interface

Think of a nice user interface for performing file queries. What would be the easiest way for a user to use the file system?

Part V: Corner Cases

Are there any tasks where recording this extra information would be problematic? For example, it generates too much data or recording it is irrelevant.
A lot of programs generate short-lived temporary files. How would the file system handle temporaries?
More generally, how would the file system handle file removal? How is the metadata updated?
How would the file system deal with mundane tasks like backups, copies, moves, file renaming, etc?
Are there any other corner cases?

Part VI: Random Thoughts

Would building this file system be feasible? If so, why? If not, why not?
Do you think a file system like this would be useful?

Grading

There is no right solution to this project. Your grade is determined by the level of detail provided in answers to the above questions.