GFS and HDFS

About distributed file system

GFS and HDFS

About distributed file system

Distributed file system

Introduction

Both GFS and HDFS are distributed file system. GFS stands for Google file system, it’s designed by google while HDFS is an open source version of distributed file system that referenced GFS. The motivation that we need distributed file system was the astronomical increase on data. Google has designed GFS because the huge increase of index of their search engine. The creator of HDFS is also the creator of Lucene search enginee has encounter similar problems. However, to design a distributed system is not as simple as design a single machine file system. There are several challenges to be overcome.

  • fault tolerance
  • high performance
  • network communication
  • replicas
  • consistency

The archetecture of GFS is designed as master server and chunkserver, master server is reponse for manage the namespace, access control, and mapping from the files to chunks. Whereas chunk server is response storing the real data object.

gfs

let’s explain the process of creating a file in GFS, first of all, client sends information about the file or object it wants to store on GFS by the file name chunk index to the master server of GFS then master server will start scheduling which chunk server to store file, the location of the chunk, after finishing the negotiation the master server will send back the necessary information to client typically the chunk handle and chunk location. Afterwards, client can talk to chunk server directly.

master server failure -> operation logs chunk server failure -> heartbeat signal

write: 1. master find the most up to date chunk server (check the version number) 2. pick one as primary rest as secondary server 3. increment version number 4. master server tells client who is primary and secondary chunk server 5. when all secondary server say yes to primary, primary say yes to client, otherwise, say no

success in many google applicatioins that relies on the underline distributed system, however, there are still several bottlenecks, like only one master that has to handle thousands of requests or master server need huge amount of memory to store the chunk server handle information.

Avatar
Terry Pan
Student of Data Science

My research interests include Machine Learning, Data Science, Information Security and Software Engineering. I like to think like a engineer to tackle real world problems.

Related

comments powered by Disqus