Distributed file system
Introduction
Both GFS and HDFS are distributed file system. GFS stands for Google file system, it’s designed by google while HDFS is an open source version of distributed file system that referenced GFS. The motivation that we need distributed file system was the astronomical increase on data. Google has designed GFS because the huge increase of index of their search engine. The creator of HDFS is also the creator of Lucene search enginee has encounter similar problems. However, to design a distributed system is not as simple as design a single machine file system. There are several challenges to be overcome.
- fault tolerance
- high performance
- network communication
- replicas
- consistency
The archetecture of GFS is designed as master server and chunkserver, master server is reponse for manage the namespace, access control, and mapping from the files to chunks. Whereas chunk server is response storing the real data object.
let’s explain the process of creating a file in GFS, first of all, client sends
information about the file or object it wants to store on GFS by the file name
chunk index to the master server of GFS then master server will start scheduling
which chunk server to store file, the location of the chunk, after finishing the
negotiation the master server will send back the necessary information to
client typically the chunk handle
and chunk location
. Afterwards, client
can talk to chunk server directly.
master server failure -> operation logs chunk server failure -> heartbeat signal
write:
1. master find the most up to date chunk server (check the version number)
2. pick one as primary rest as secondary server
3. increment version number
4. master server tells client who is primary and secondary chunk server
5. when all secondary server say yes
to primary, primary say yes
to client, otherwise, say no
success in many google applicatioins that relies on the underline distributed system, however, there are still several bottlenecks, like only one master that has to handle thousands of requests or master server need huge amount of memory to store the chunk server handle information.