READ Free Dumps For Cloudera- CCD-410
Question ID 12499 | A client application creates an HDFS file named foo.txt with a replication factor of 3. Identify
which best describes the file access rules in HDFS if the file has a single block that is
stored on data nodes A, B and C?
|
Option A | The file will be marked as corrupted if data node B fails during the creation of the file.
|
Option B | Each data node locks the local file to prohibit concurrent readers and writers of the file.
|
Option C | Each data node stores a copy of the file in the local file system with the same name as the HDFS file.
|
Option D | The file can be accessed if at least one of the data nodes storing the file is available.
|
Correct Answer | D |
Explanation Explanation: HDFS keeps three copies of a block on three different datanodes to protect against true data corruption. HDFS also tries to distribute these three replicas on more than one rack to protect against data availability issues. The fact that HDFS actively monitors any failed datanode(s) and upon failure detection immediately schedules re-replication of blocks (if needed) implies that three copies of data on three different nodes is sufficient to avoid corrupted files. Note: HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. HDFS uses rack-aware replica placement policy. In default configuration there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same rack and 3rd copy on a different rack. Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers , How the HDFS Blocks are replicated?
Question ID 12500 | The Hadoop framework provides a mechanism for coping with machine issues such as
faulty configuration or impending hardware failure. MapReduce detects that one or a
number of machines are performing poorly and starts more copies of a map or reduce task.
All the tasks run simultaneously and the task finish first are used. This is called:
|
Option A | Combine
|
Option B | IdentityMapper
|
Option C | IdentityReducer
|
Option D | Default Partitioner
|
Option E | Speculative Execution
|
Correct Answer | E |
Explanation Explanation: Speculative execution: One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes. By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first. Reference: Apache Hadoop, Module 4: MapReduce Note: * Hadoop uses "speculative execution." The same task may be started on multiple boxes. The first one to finish wins, and the other copies are killed. Failed tasks are tasks that error out. * There are a few reasons Hadoop can kill tasks by his own decisions: a) Task does not report progress during timeout (default is 10 minutes) b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) or queue (CapacityScheduler). c) Speculative execution causes results of task not to be needed since it has completed on other place. Reference: Difference