Data is being generated at an enormous rate, due to online activities and use of resources related to computing. To access and handle such enormous amount of data spread, dis- tributed systems is an efficient mechanism. One such widely used distributed filesystem is Hadoop distributed filesystem (HDFS). HDFS follows a cluster approach in order to store huge amounts of data, it is scalable and works on low commodity. It uses MapRe- duce framework to perform analysis and carry computations parallely on these large data sets. Hadoop follows the master/slave architecture decoupling system metadata and appli- cation data where metadata is stored on dedicated server NameNode and application data on DataNodes. In this thesis work, study was performed on Hadoop Architecture, behaviour of filesys- tem and MapReduce in detail and concluded that processing of MapReduce is slow which was further confirmed by initial analysis and experiments performed on default Hadoop configuration. It is known that accessing data from cache is much faster as compared to disk access. Collaborative caching is one such mechanism in which the cache distributed over the clients or dedicated servers or storage devices form a single cache to serve the re- quests. This mechanism helps in improving the performance, reducing access latency and increasing the throughput. This coupled with prefetching enhances the performance. In order to enhance and improve the performance of MapReduce, the thesis proposes solution of new design of HDFS by introducing caching references, collaborative caching along with prefetching coupled with Modified-ARC cache replacement. Each of the DataN- odes would have a dedicated Cache Manager to maintain information about its local cache, remote caches and follow cache replacement algorithm. Initial analysis led to conclusion that caching references too help in improving performance. Modified-ARC helps in orga- nizing the cache in a different way as recent, frequent and history of evicted items which is a better cache replacement policy and improves the execution time and performance of MapReduce.The evaluation of the results were done by comparing the results obtained with that of default configuration in psuedo-distributed and fully distributed mode.

Library of Congress Subject Headings

Apache Hadoop; Cache memory; File organization (Computer science); Electronic data processing--Distributed processing

Publication Date


Document Type


Student Type

- Please Select One -

Department, Program, or Center

Computer Science (GCCIS)


Kwon, Minseok

Advisor/Committee Member

Heliotis, James


Note: imported from RIT’s Digital Media Library running on DSpace to RIT Scholar Works. Physical copy available through RIT's The Wallace Library at: TK7895.M4 S47 2012


RIT – Main Campus