Hadoop Admin Interview
Questions and Answers
1) How will you decide whether you need to use the Capacity
Scheduler or the Fair Scheduler?
Fair Scheduling is the process in which resources are assigned to
jobs such that all jobs get to share equal number of resources over time. Fair
Scheduler can be used under the following circumstances -
i) If you wants the jobs to make equal progress instead of
following the FIFO order then you must use Fair Scheduling.
ii) If you have slow connectivity and data locality plays a vital
role and makes a significant difference to the job runtime then you must use
Fair Scheduling.
iii) Use fair scheduling if there is lot of variability in the
utilization between pools.
Capacity Scheduler allows runs the hadoop mapreduce cluster as a
shared, multi-tenant cluster to maximize the utilization of the hadoop cluster
and throughput.Capacity Scheduler can be used under the following circumstances
-
i) If the jobs require scheduler detrminism then Capacity
Scheduler can be useful.
ii) CS's memory based scheduling method is useful if the jobs have
varying memory requirements.
iii) If you want to enforce resource allocation because you
know very well about the cluster utilization and workload then use Capacity
Scheduler.
2) What are the daemons required to run a Hadoop cluster?
NameNode, DataNode, TaskTracker and JobTracker
3) How will you restart a NameNode?
The easiest way of doing this is to run the command to stop
running shell script i.e. click on stop-all.sh. Once this is done, restarts the
NameNode by clicking on start-all.sh.
4) Explain about the different schedulers available in Hadoop.
·
FIFO Scheduler – This scheduler does not consider the
heterogeneity in the system but orders the jobs based on their arrival times in
a queue.
·
COSHH- This scheduler considers the workload, cluster and the user
heterogeneity for scheduling decisions.
·
Fair Sharing-This Hadoop scheduler defines a pool for each user.
The pool contains a number of map and reduce slots on a resource. Each user can
use their own pool to execute the jobs.
5) List few Hadoop shell commands that are used to perform a copy
operation.
·
fs –put
·
fs –copyToLocal
·
fs –copyFromLocal
6) What is jps command used for?
jps command is used to verify whether the daemons that run the
Hadoop cluster are working or not. The output of jps command shows the status
of the NameNode, Secondary NameNode, DataNode, TaskTracker and JobTracker.
7) What are the important hardware considerations when deploying
Hadoop in production environment?
·
Memory-System’s memory requirements will vary between the worker
services and management services based on the application.
·
Operating System - a 64-bit operating system avoids any
restrictions to be imposed on the amount of memory that can be used on worker
nodes.
·
Storage- It is preferable to design a Hadoop platform by moving
the compute activity to data to achieve scalability and high performance.
·
Capacity- Large Form Factor (3.5”) disks cost less and allow to
store more, when compared to Small Form Factor disks.
·
Network - Two TOR switches per rack provide better redundancy.
·
Computational Capacity- This can be determined by the total number
of MapReduce slots available across all the nodes within a Hadoop cluster.
8) How many NameNodes can you run on a single Hadoop cluster?
Only one.
9) What happens when the NameNode on the Hadoop cluster goes down?
The file system goes offline whenever the NameNode is down.
10) What is the conf/hadoop-env.sh file and which variable in
the file should be set for Hadoop to work?
This file provides an environment for Hadoop to run and consists
of the following variables-HADOOP_CLASSPATH, JAVA_HOME and HADOOP_LOG_DIR.
JAVA_HOME variable should be set for Hadoop to run.
11) Apart from using the jps command is there any other way that
you can check whether the NameNode is working or not.
Use the command -/etc/init.d/hadoop-0.20-namenode status.
12) In a MapReduce system, if the HDFS block size is 64 MB and
there are 3 files of size 127MB, 64K and 65MB with FileInputFormat. Under this
scenario, how many input splits are likely to be made by the Hadoop framework.
2 splits each for 127 MB and 65 MB files and 1 split for the 64KB
file.
13) Which command is used to verify if the HDFS is corrupt or not?
Hadoop FSCK (File System Check) command is used to check
missing blocks.
14) List some use cases of the Hadoop Ecosystem
Text Mining, Graph
Analysis, Semantic Analysis, Sentiment Analysis, Recommendation Systems.
15) How can you kill a Hadoop job?
Hadoop job –kill jobID
16) I want to see all the jobs running in a Hadoop cluster. How
can you do this?
Using the command –
Hadoop job –list, gives the list of jobs running in a Hadoop cluster.
17) Is it possible to copy files across multiple clusters?
If yes, how can you accomplish this?
Yes, it is possible to copy files across
multiple Hadoop clusters and this can be achieved using distributed copy.
DistCP command is used for intra or inter cluster copying.
18) Which is the best operating system to run Hadoop?
Ubuntu or Linux is the most preferred
operating system to run Hadoop. Though Windows OS can also be used to run
Hadoop but it will lead to several problems and is not recommended.
19) What are the network requirements to run Hadoop?
·
SSH is required to run - to launch server processes on the slave
nodes.
·
A password less SSH connection is required between the master,
secondary machines and all the slaves.
20) The mapred.output.compress property is set to true, to
make sure that all output files are compressed for efficient space usage on the
Hadoop cluster. In case under a particular condition if a cluster user does not
require compressed data for a job. What would you suggest that he do?
If the user does not want to compress the data
for a particular job then he should create his own configuration file and set
the mapred.output.compress property to false. This configuration file then should be
loaded as a resource into the job.
21) What is the best practice to deploy a secondary NameNode?
It is always better to deploy a secondary NameNode on a separate
standalone machine. When the secondary NameNode is deployed on a separate
machine it does not interfere with the operations of the primary node.
22) How often should the NameNode be reformatted?
The NameNode should never be reformatted. Doing so will result in
complete data loss. NameNode is formatted only once at the beginning after
which it creates the directory structure for file system metadata and namespace
ID for the entire file system.
23) If Hadoop spawns 100 tasks for a job and one of the job fails.
What does Hadoop do?
The task will be started again on a new TaskTracker and if it
fails more than 4 times which is the default setting (the default value can be
changed), the job will be killed.
24) How can you add and remove nodes from the Hadoop cluster?
·
To add new nodes to the HDFS cluster, the
hostnames should be added to the slaves file and then DataNode and TaskTracker
should be started on the new node.
·
To remove or decommission nodes from the HDFS cluster, the
hostnames should be removed from the slaves file and –refreshNodes should be
executed.
25) You increase the replication level but notice that the data is
under replicated. What could have gone wrong?
Nothing could have actually wrong, if there is huge volume of data
because data replication usually takes times based on data size as the cluster
has to copy the data and it might take a few hours.
26) Explain about the different configuration files and where
are they located.
The configuration files are located in “conf” sub directory.
Hadoop has 3 different Configuration files- hdfs-site.xml, core-site.xml and
mapred-site.xml
No comments:
Post a Comment