Project Description
BTRFS is a new, actively developed file system with various advanced features. I wish to implement content-based-storage mode for btrfs file system. In fact, this project is also mentioned in the TODO-list of the BTRFS ideas page.
In some applications, such as Internet content-caches, most often than not, the data is read-only. For such cases, the lookup time is the most important metric. It is very inefficient for such applications to store data in a conventional file-path based manner. In content-based-storage mode, the data is stored on the disk only on the basis of "hash" of its content. The lookup is also hash based - thus extremely quick. Another advantage of hash-based storage is that data duplication is not possible.
My research at CMU aims at building content-caches for routers https://github.com/harshadjs/xia-content-cache. It demands a file system that allows such a storage mode. I think it would be ideal for the interests of BTRFS community and the research at CMU if I could work on this project in the summer.
Biography and Technical Background
I am a Computer Science Graduate student at Carnegie Mellon University with research interest primarily in Computer Networks. I use Linux daily and am passionate about Open source software development.
In my undergraduate years, I worked on a open-source Linux kernel project "Snapshots for Ext4 filesystem". Patches were sent to the Ext4 community for review. I received a mention for the contribution to the project at http://lwn.net/Articles/442078/ .
We were interested in extend Ext4 snapshots project, and so I participated in Google Summer of Code 2011. My proposal for "Snapshot revert feature for Ext4" was accepted by The Fedora Project and I successfully completed the project back then. I look forward to continue my interest and be associated with the Fedora project by applying the proposal "Content-storage mode for BTRFS" for the year 2015.
I have worked for a Wi-Fi technology startup "AirTight Networks" for 3 years (2011-2014), where I was working in the Linux device drivers team.
I then joined Carnegie Mellon University in May 2014, where my main area of studies is Computer Networks.
You can expect a very high level of fluency with C and Kernel programming from me. This is something that I love to do.
Goals
- 75% Goal
- Create a new "Content" tree. This tree should store hashes of all the extents in the file system.
- Create a "File Hash" tree. This tree should will store the mapping from hash of a file to its inode.
- Provide option to enable / disable content-storage-mode at mount-time or mkfs-time (TBD).
- Implement all the reference counting mechanisms for extents in this content-tree.
- 100% Goal
- Intercept writes and check if the data that is being written is already in the content tree.
- Intercept reads
- Given the hash of file, lookup inode for a file from "File Hash" tree.
- Enhance debugging methods available in btrfs (I am not sure which ones are available) to support debugging content-trees.
- 125% Goal
- Provide various mount-time configuration options, such as:
- Remove or Don't remove extents if reference count becomes 0. (Especially useful for our routing application.)
- Verify or Trust the checksum of extents.
Milestones of the Project
- M1: Understand the design and code of Btrfs. Especially focus on how the current extent-trees, subvolume trees, snapshot trees are setup initially. Study on-disk data structures, most likely, we are going to need to add some bits in the super-block: For example "content-storage-mode-on/off".
- M2: Understand and identify the code areas wherein the hooks are to be applied. Need to find hooks for:
- Intercepting writes
- Reading extents
- Debugging interfaces
- M3: Write a detailed design draft which will talk about all the overall goal, required on-disk-changes, functions to be modified. Share the draft with BTRFS community and get their views.
- M4: Implementation and testing of the code: 75%
- M5: Implementation and testing of the code: 100%
- M6: Implementation and testing of the code: 125% (If time permits)
- M7: Write documentation of the final product
Plan of action
- By the end of the week 1: M1, M2
- By the end of the week 2: M3
- (Midterm) By the end of the week 5: M4
- By the end of the week 7: M5
- By the end of the week 9: M6
- (End) By the end of the week 10: M7
Why choose me?
- Past successful GSoC student (2011).
- Past experience of working with the open source community.
- Strong understanding of file systems, C programming language, the UNIX philosophy, Linux.
- Passionate about contributing to Linux.
Time commitment
Apart from this project, I have research commitment at CMU. So, I expect to spend at least 30 hrs / week on this project. My final exams end on 13th May 2015 and I hope to start right after that. I will be visiting my hometown (Pune, India) towards the May-End / June first week. That is the only time when I could be a little slacked. Rest of the summer, I will be on top of the project.