Hadoop Cluster
Interview Questions
1. Which are the modes in which Hadoop can
run?
We have three modes in which Hadoop can run and that are:
- Standalone (local) mode: Default mode of Hadoop, it uses the local file system
for input and output operations. This mode is mainly used for debugging
purpose, and it does not support the use of HDFS. Further, in this mode,
there is no custom configuration required for mapred-site.xml,
core-site.xml, hdfs-site.xml files.
- Pseudo-distributed mode: In this case, you need configuration for all the
three files mentioned above. In this case, all daemons are running on one
node and thus, both Master and Slave nodes are on the same machine.
- Fully distributed mode: This is the production phase of Hadoop where data
is distributed across several nodes on a Hadoop cluster. Separate nodes
are allotted as Master and Slaves.
2. What are the features of Standalone
(local) mode?
In stand-alone mode, there are no daemons, everything
runs on a single JVM. It has no DFS and it utilizes the local file
system. Stand-alone mode is suitable only for running MapReduce programs
during development for testing. It is one of the most least used environments.
3. What are the features of Pseudo mode?
Pseudo mode is used in both for development and in the
testing environment. In the Pseudo mode, all the daemons run on the same
machine.
4. What are the features of Fully Distributed
mode?
This is an important question as Fully Distributed mode
is used in the production environment, where we have ‘n’ number of machines
forming a Hadoop cluster. Hadoop daemons run on a cluster of machines. There is
one host onto which Namenode is running and other hosts on which Datanodes are
running. NodeManagers are installed on every DataNode and it is responsible for
execution of the task on every single DataNode. All these NodeManagers are
managed by ResourceManager, which receives the processing requests, and then
passes the parts of requests to corresponding NodeManagers accordingly.
5. What is configured in /etc/hosts and what
is its role in setting Hadoop cluster?
This is a technical question which challenges your basic
concept. /etc/hosts file contains the hostname and their IP address of that
host. It maps the IP address to the hostname. In Hadoop cluster, we store all
the hostnames (master and slaves) with their IP addresses in /etc/hosts so,
that we can use hostnames easily instead of IP addresses.
6. What are the default port numbers of
NameNode, ResourceManager & MapReduce JobHistory Server?
You are expected to remember basic server port numbers if
you are working with Hadoop. The port number for corresponding daemons is as
follows:
Namenode – ’50070’
ResourceManager – ’8088’
MapReduce JobHistory Server
– ’19888’.
7. What are the main Hadoop configuration
files?
♣ Tip: Generally, approach this question by telling
the 4 main configuration files in Hadoop and giving their brief descriptions to
show your expertise.
- core-site.xml: core-site.xml
informs Hadoop daemon where NameNode runs on the cluster. It contains configuration
settings of Hadoop core such as I/O settings that are common to HDFS &
MapReduce.
- hdfs-site.xml: hdfs-site.xml
contains configuration settings of HDFS daemons (i.e. NameNode, DataNode,
Secondary NameNode). It also includes the replication factor and block
size of HDFS.
- mapred-site.xml:
mapred-site.xml contains configuration settings of the MapReduce framework
like number of JVM that can run in parallel, the size of the mapper and
the reducer, CPU cores available for a process, etc.
- yarn-site.xml: yarn-site.xml
contains configuration settings of ResourceManager and NodeManager like
application memory management size, the operation needed on program &
algorithm, etc.
These files are in the conf/hadoop/ directory
inside Hadoop directory.
8. How does Hadoop CLASSPATH plays vital role
in starting or stopping in Hadoop daemons?
♣ Tip: To check your knowledge on Hadoop the
interviewer may ask you this question.
CLASSPATH includes all the directories containing jar
files required to start/stop Hadoop daemons. The CLASSPATH is set inside /etc/hadoop/hadoop-env.sh file.
9. What is a spill factor with respect to the
RAM?
♣ Tip: This is a theoretical question, but if you
add a practical taste to it, you might get a preference.
The map output is stored in an in-memory buffer; when
this buffer is almost full, then spilling phase starts in order to move data to
a temp folder.
Map output is first written to buffer and buffer size is
decided by mapreduce.task.io.sort.mb property .By default, it
will be 100 MB.
When the buffer reaches certain threshold, it will start
spilling buffer data to disk. This threshold is specified in mapreduce.map.sort.spill.percent
.
10. What is command to extract the compressed
file in tar.gz format?
This is an easy question, tar -xvf
/file_location/filename.tar.gz command will extract the tar.gz
compressed file.
11. How will you check Java and Hadoop is
installed on your system?
By using the following commands we can check whether Java
and Hadoop are installed and their paths are set inside .bashrc file:
For checking Java – java -version
For checking Hadoop – hadoop version
12. What is the default replication factor
and how will you change it?
The default replication factor is 3.
♣ Tip: Default Replication Factor could be changed in three ways. Answering all the three
ways will show your expertise.
- By adding this property to hdfs-site.xml:
1
2
3
4
5
|
<property>
<name>dfs.replication</name>
<value>5</value>
<description>Block
Replication</description>
</property>
|
- Or you can change the replication factor on per file
basis using following command:
hadoop fs –setrep –w 3 /file_location
- Or you can change replication factor for all the files
in a directory using the following command:
hadoop fs –setrep –w 3 -R /directory_location
13. What is the full form of fsck?
The full form of fsck
is File System Check. HDFS supports the fsck (filesystem check) command to check for various
inconsistencies. It is designed for reporting the problems with the files in
HDFS, for example, missing blocks of a file or under-replicated blocks.
14. Which are the main hdfs-site.xml
properties?
The three main hdfs-site.xml properties are:
- dfs.name.dir gives you the location where NameNode stores the
metadata (FsImage and edit logs) and where DFS is located – on the disk or
onto the remote directory.
- dfs.data.dir which gives you the location of DataNodes, where the
data is going to be stored.
- fs.checkpoint.dir is the directory on the filesystem where the
Secondary NameNode stores the temporary images of edit logs, which is to
be merged and the FsImage for backup.
15.
What happens if you get a ‘connection refused java exception’ when you type
hadoop fsck /?
If you get a ‘connection refused java exception’ when you
type hadoop fsck, it could mean that the NameNode is not working on your VM.
16. How can we view the compressed files via
HDFS command?
We can view compressed files in HDFS using hadoop
fs -text /filename command.
17. What is the command to move into safe mode
and exit safe mode?
♣ Tip: Approach this question by first explaining safe mode and then moving on to the
commands.
Safe Mode in Hadoop is a maintenance state of the
NameNode during which NameNode doesn’t allow any changes to the file system.
During Safe Mode, HDFS cluster is read-only and doesn’t replicate or delete
blocks.
- To know the status of safe mode,
you can use the command: hdfs dfsadmin -safemode get
- To enter safe mode: hdfs
dfsadmin -safemode enter
- To exit safe mode: hdfs
dfsadmin -safemode leave
18. What does ‘jps’ command does?
jps command is used to check all the Hadoop daemons
like NameNode, DataNode, ResourceManager, NodeManager etc. which are running on
the machine.
19. How can I restart Namenode?
This question has two answers, answering both will give
you a plus point. We can restart NameNode by following methods:
- You can stop the NameNode
individually using. /sbin /hadoop-daemon.sh stop
namenode command and then start the NameNode using./sbin/hadoop-daemon.sh
start namenode
- Use ./sbin/stop-all.shand and
then use ./sbin/start-all.sh command which will
stop all the daemons first and then start all the daemons.
20. How can we check whether NameNode is
working or not?
To check whether NameNode is working or not, use
the jps command, this will show all the running Hadoop
daemons and there you can check whether NameNode daemon is running or not.
21. How can we look at the Namenode in the
web browser UI?
If you want to look for NameNode in the browser, the port
number for NameNode web browser UI is 50070. We can check in
web browser using http://master:50070/dfshealth.jsp.
22. What are the different commands used to
startup and shutdown Hadoop daemons?
♣ Tip: Explain all the three ways of stopping
and starting Hadoop daemons, this will show your expertise.
- ./sbin/start-all.sh to start all the Hadoop daemons and ./sbin/stop-all.sh to
stop all the Hadoop daemons.
- Then you can start all the dfs
daemons together using. /sbin/start-dfs.sh,
yarn daemons together using. /sbin/start-yarn.sh and
MR job history server using. /sbin/mr-jobhistory-daemon.sh
start historyserver. To stop these daemons similarly we can use. /sbin/start-yarn.sh, ./sbin/start-yarn.sh &. /sbin/mr-jobhistory-daemon.sh
stop historyserver.
- The last way is to start all the
daemons individually and stop them individually:
./sbin/hadoop-daemon.sh start namenode
./sbin/hadoop-daemon.sh start datanode
./sbin/yarn-daemon.sh start resourcemanager
./sbin/yarn-daemon.sh start nodemanager
./sbin/mr-jobhistory-daemon.sh start historyserver
and stop them similarly.
23. What do slaves file consist of?
Slaves file consists of a list of hosts, one per
line and the list contains DataNode location on which Node Manager servers run.
24. What do masters file consist of?
The masters file contains Secondary NameNode server
location.
25.
What does hadoop-env.sh do?
hadoop-env.sh provides the environment
for Hadoop to run. For example, JAVA_HOME, CLASSPATH etc. are set over here.
26.
Where is hadoop-env.sh file present?
As we discussed earlier, where all the configuration
files reside, thus hadoop-env.sh file is present in the /etc/hadoop directory.
27.
In Hadoop_PID_DIR, what does PID stands for? What does it do?
PID stands for ‘Process ID’. This directory stores the
Process ID of the servers that are running.
28.
What does hadoop-metrics.properties file do?
♣ Tip: As this file
is configured manually only in special cases, so answering this question will
impress the interviewer indicating your expertise about configuration files.
hadoop-metrics.properties is
used for ‘Performance Reporting‘ purposes. It controls
the reporting for Hadoop. The API is abstract so that it can be implemented on
top of a variety of metrics client libraries. The choice of client library is a
configuration option, and different modules within the same application can use
different metrics implementation libraries. This file is stored inside /etc/hadoop.
29.
What are the network requirements for Hadoop?
You should answer this question as; the Hadoop core uses
Shell (SSH) for communication with salve and to launch the server processes on the
slave nodes. It requires a password-less SSH
connection between the master and all the slaves and the secondary machines, so
every time it does not have to ask for authentication as master and slave
requires rigorous communication.
30.
Why do we need a password-less SSH in Fully Distributed environment?
We need a password-less SSH
in a Fully-Distributed environment because when the cluster is live and running
in Fully Distributed environment, the communication is too frequent. The
DataNode and the NodeManager should be able to send messages quickly to master
server.
31.
Does this lead to security issues?
No, not at all. Hadoop cluster is an isolated cluster and
generally, it has nothing to do with the internet. It has a different kind of a
configuration. We needn’t worry about that kind of a security breach, for
instance, someone hacking through the internet, and so on. Hadoop has a very
secured way to connect to other machines to fetch and to process data.
32.
On which port does SSH work?
SSH works on Port No. 22, though it can be
configured. 22 is
the default Port number.
33.
Can you tell us more about SSH?
SSH is nothing but a secure shell communication, it is a
kind of a protocol that works on a Port No. 22, and when you do an SSH, what
you really require is a password, to connect to the other machine. SSH is not
only between masters and slaves, but can be between two hosts.
34.
What happens to a NameNode, when ResourceManager is down?
When a ResourceManager is down, it will not be functional
(for submitting jobs) but NameNode will be present. So, the cluster is
accessible if NameNode is working, even if the ResourceManager is not working.
35.
How can we format HDFS?
♣ Tip: Attempt this
question by starting with the command to format the HDFS and then exlain what
this command does.
Hadoop distributed file system(HDFS) can be formatted
using bin/hadoop namenode -format command.
This command formats the HDFS via NameNode. This command is only executed for the first time. Formatting the
file system means initializing the directory specified by the
dfs.name.dir variable. If you run this command on existing
filesystem, you will lose all your data stored on your NameNode. Formatting a
Namenode will not format the DataNode. It will format the FsImage and edit logs
data stored on the NameNode and will lose the data about the location of blocks
stored in HDFS.
Never format, up and running Hadoop filesystem. You will
lose all your data stored in the HDFS.
36.
Can we use Windows for Hadoop?
Red Hat Linux and Ubuntu are the best
Operating Systems for Hadoop. Windows is not used frequently for installing
Hadoop as there are many support problems attached with Windows. Thus, Windows
is not a preferred environment for Hadoop.
Hadoop MapReduce Interview Questions
MapReduce is a programming framework that allows us to
perform distributed and parallel processing on large data sets in a distributed
environment.
- MapReduce 2.0 or YARN
Architecture:
- MapReduce framework also follows
Master/Slave Topology where the master node (Resource Manager) manages
and tracks various MapReduce jobs being executed on the slave nodes (Node
Mangers).
- Resource Manager consists of two
main components:
- Application Manager: It accepts job-submissions, negotiates the
container for ApplicationMaster and handles failures while executing
MapReduce jobs.
- Scheduler: Scheduler allocates resources that are required
by various MapReduce applications running on the Hadoop cluster.
- How MapReduce job works:
- As the name MapReduce suggests,
reducer phase takes place after mapper phase has been completed.
- So, the first is the map job,
where a block of data is read and processed to produce key-value pairs as
intermediate outputs.
- The reducer receives the
key-value pair from multiple map jobs.
- Then, the reducer aggregates
those intermediate data tuples (intermediate key-value pair) into a
smaller set of tuples or key-value pairs which is the final output.
“A mind troubled by doubt cannot focus on the
course to victory.”
–
Arthur Golden
1. What do you mean by data locality?
Data locality talks about moving computation unit to data
rather data to the computation unit. MapReduce framework achieves data locality
by processing data locally i.e. processing of the data happens in the very node
by Node Manager where data blocks are present.
2. Is it mandatory to set input and
output type/format in MapReduce?
No, it is not mandatory to set the input and output
type/format in MapReduce. By default, the cluster takes the input and the
output type as ‘text’.
3. Can we rename the output file?
Yes,
we can rename the output file by implementing multiple format output
class.
4. What do you mean by shuffling and sorting
in MapReduce?
Shuffling and sorting takes place after the completion of
map task where the input to the
every reducer is sorted according to the keys. Basically, the process by which
the system sorts the key-value output of the map tasks and transfer it to the
reducer is called shuffle.
5. Explain the process of spilling in
MapReduce?
The output of a map task is written into a circular
memory buffer (RAM). The default size of buffer is set to 100 MB which
can be tuned by using mapreduce.task.io.sort.mb property. Now, spilling is a
process of copying the data from memory buffer to disc when the content of the
buffer reaches a certain threshold size. By default, a background thread starts
spilling the contents from memory to disc after 80% of the buffer size is
filled. Therefore, for a 100 MB size buffer the spilling will start after the
content of the buffer reach a size of 80 MB.
Note: One can change this spilling threshold using
mapreduce.map.sort.spill.percent which is set to 0.8 or 80% by default.
6. What is a distributed cache in MapReduce
Framework?
Distributed Cache can be explained as, a facility
provided by the MapReduce framework to cache files needed by applications. Once
you have cached a file for your job, Hadoop framework will make it available on
each and every data nodes where you map/reduce tasks are running. Therefore,
one can access the cache file as a local file in your Mapper or Reducer job.
7. What is a combiner and where you should
use it?
Combiner is like a mini reducer function that allows us
to perform a local aggregation of map output before it is transferred to
reducer phase. Basically, it is used to optimize the network bandwidth usage
during a MapReduce task by cutting down the amount of data that is transferred
from a mapper to the reducer.
8. Why the output of map tasks are stored (spilled)
into local disc and not in HDFS?
The outputs of map task are the intermediate key-value
pairs which is then processed by reducer to produce the final aggregated
result. Once a MapReduce job is completed, there is no need of the intermediate
output produced by map tasks. Therefore, storing these intermediate output into
HDFS and replicate it will create unnecessary overhead.
9. What happens when the node running the map
task fails before the map output has been sent to the reducer?
In this case, map task will be assigned a new node and
whole task will be run again to re-create the map output.
10. Define Speculative Execution
If a node appears to be executing a task slower than
expected, the master node can redundantly execute another instance of the same
task on another node. Then, the task which finishes first will be accepted
whereas other task will be killed. This process is called speculative
execution.
11. What is the role of a MapReduce
Partitioner?
A partitioner divides the intermediate key-value pairs
produced by map tasks into partition. The total number of partition is equal to
the number of reducers where each partition is processed by the corresponding
reducer. The partitioning is done using the hash function based on a single key
or group of keys. The default partitioner available in Hadoop is HashPartitioner.
12. How can we assure that the values
regarding a particular key goes to the same reducer?
By using a partitioner we can control that a particular
key – value goes to the same reducer for processing.
13. What is the difference between Input
Split and HDFS block?
HDFS block defines how the data is physically divided in
HDFS whereas input split defines the logical boundary of the records required
for processing it.
14. What do you mean by InputFormat?
InputFormat describes the input-specification for a
MapReduce job.The MapReduce framework relies on the InputFormat of the job to:
- Validate the input-specification
of the job.
- Split-up the input file(s) into
logical InputSplit instances, each of which is then assigned to an
individual Mapper.
- Provide the RecordReader
implementation used to read records from the logical InputSplit for
processing by the Mapper.
15. What is the purpose of TextInputFormat?
TextInputFormat is the default input format present in
the MapReduce framework. In TextInputFormat, an input file is produced as keys
of type LongWritable (byte offset of the beginning of the line in the file) and
values of type Text (content of the line).
16. What is the role of RecordReader in
Hadoop MapReduce?
InputSplit defines a slice of work, but does not describe
how to access it. The “RecordReader” class loads the data from its source and
converts it into (key, value) pairs suitable for reading by the “Mapper” task.
The “RecordReader” instance is defined by the “Input Format”.
17. What are the various configuration
parameters required to run a MapReduce job?
The main configuration parameters which users need to specify in
“MapReduce” framework are:
- Job’s input locations in the
distributed file system
- Job’s output location in the
distributed file system
- Input format of data
- Output format of data
- Class containing the map function
- Class containing the reduce
function
- JAR file containing the mapper,
reducer and driver classes
18. When should you use
SequenceFileInputFormat?
SequenceFileInputFormat is an input format for reading
within sequence files. It is a specific compressed binary file format which is
optimized for passing the data between the outputs of one “MapReduce” job to
the input of some other “MapReduce” job.
Sequence files can be generated as the output of other
MapReduce tasks and are an efficient intermediate representation for data that
is passing from one MapReduce job to another.
19. What is an identity Mapper and Identity
Reducer?
Identity mapper is the default mapper provided by the
Hadoop framework. It runs when no mapper class has been defined in the
MapReduce program where it simply passes the input key – value pair for the
reducer phase.
Like Identity Mapper, Identity Reducer is also the
default reducer class provided by the Hadoop, which is automatically executed
if no reducer class has been defined. It also performs no computation or
process, rather it just simply write the input key – value pair into the
specified output directory.
20. What is a map side join?
Map side join is a process where two data sets are joined
by the mapper.
21. What are the advantages of using map side
join in MapReduce?
The advantages of using map side join in MapReduce are as
follows:
- Map-side join helps in minimizing
the cost that is incurred for sorting and merging in the shuffle and
reduce stages.
- Map-side join also helps in
improving the performance of the task by decreasing the time to finish the
task.
22. What is reduce side join in MapReduce?
As the name suggests, in the reduce side join, the
reducer is responsible for performing the join operation. It is comparatively
simple and easier to implement than the map side join as the sorting and
shuffling phase sends the values having identical keys to the same reducer and
therefore, by default, the data is organized for us.
♣Tip: I would suggest you to go through a dedicated blog
on reduce side join in MapReduce where the whole process of
reduce side join is explained in detail with an example.
23. What do you know about NLineInputFormat?
NLineInputFormat splits ‘n’ lines of input as one split.
24. Is it legal to set the number of reducer
task to zero? Where the output will be stored in this case?
Yes, It is legal to set the number of reduce-tasks
to zero if there is no need for a reducer. In this case the
outputs of the map task is directly stored into the HDFS which
is specified in the setOutputPath(Path).
25. Is it necessary to write a MapReduce job
in Java?
No, MapReduce framework supports multiple languages
like Python, Ruby etc.
26. How do you stop a running job gracefully?
One
can gracefully stop a MapReduce job by using the command: hadoop job -kill
JOBID
27. How will you submit extra files or data (
like jars, static files, etc. ) for a MapReduce job during runtime?
The distributed cache is used to distribute large
read-only files that are needed by map/reduce jobs to the cluster. The
framework will copy the necessary files from a URL on to the slave node before
any tasks for the job are executed on that node. The files are only copied once
per job and so should not be modified by the application.
28. How does inputsplit in MapReduce
determines the record boundaries correctly?
RecordReader is responsible for providing the information
regarding record boundaries in an input split.
29. How do reducers communicate with each
other?
This is a tricky question. The “MapReduce” programming
model does not allow “reducers” to communicate with each other. “Reducers” run
in isolation.
I hope you find this blog on Hadoop MapReduce Interview
Questions to be informative and helpful. You are welcome to mention your
doubts and feedback in the comment section given below. In this blog, I have
covered the interview questions for MapReduce only. To save your time in
visiting several sites for interview questions related to each Hadoop
component, we have prepared a series of interview question blogs that covers
all the components present in Hadoop framework. Kindly, refer to the links
given below to explore all the Hadoop related interview question and strengthen
your fundamentals:
Apache Pig Interview Questions
Looking out for Apache Pig Interview Questions that
are frequently asked by employers? Here is the fifth blog of Hadoop
Interview Questions series, which covers Apache PIG interview questions. I hope
you must not have missed the earlier blogs of our Hadoop Interview Question series.
After going through the Pig interview questions, you will
get an in-depth knowledge of questions that are frequently asked by employers
in Hadoop interviews.
In case you have attended Pig interviews previously, we
encourage you to add your questions in the comments tab. We will be happy to
answer them, and spread the word to the community of fellow job seekers.
Important points to remember about Apache
Pig:
♦ Apache Pig is a platform, used to analyze
large data sets representing them as data flows. It is designed to provide an
abstraction over MapReduce, reducing the complexities of writing a MapReduce
task using Java programming. We can perform data manipulation operations very
easily in Hadoop using Apache Pig.
♦ Apache Pig has two main components – the Pig Latin language and the Pig
Run-time Environment, in which Pig Latin programs are executed.
♦Apache Pig follows ETL (Extract
Transform Load) process. It can handle inconsistent schema (in case of
unstructured data).
♦ Apache Pig automatically optimizes the tasks
before execution, i.e. automatic optimization. Apache Pig handles all kinds of data.
♦ Pig allows programmers to write custom
functions which is unavailable in Pig. User Defined Functions (UDF) can be
written in different language like Java, Python, Ruby, etc. and embed them in
Pig script.
♦ Pig Latin provides various built-in
operators like join, sort, filter, etc. to read, write, and process large data
sets.
♣ Tip: Before going through this Apache Pig
interview questions, I would suggest you to go through Apache Pig Tutorial to revise your Pig concepts.
Now moving on, let us look at the Apache Pig interview
questions.
1. Highlight the key differences between
MapReduce and Apache Pig.
♣ Tip: In this question, you should explain
what were the problems with MapReduce which led to the development of Apache
Pig by Yahoo.
The following are the key differences between Apache Pig
and MapReduce due to which Apache Pig came into picture:
- Apache Pig is a high-level data
flow platform, whereas MapReduce is a low-level data processing paradigm.
- Without writing complex Java
implementations in MapReduce, programmers can achieve the same
implementations very easily using Pig Latin.
- Apache Pig provides nested data
types like tuples, bags, and maps that are missing from MapReduce.
- Pig provides many built-in
operators to support data operations like joins, filters, ordering,
sorting etc. Whereas to perform the same function in MapReduce is a
humongous task.
2. What are the use cases of Apache Pig?
Apache Pig is used for analyzing and performing tasks
involving ad-hoc processing. Apache Pig is used for:
- Research on large raw data sets
like data processing for search platforms. For example, Yahoo uses Apache
Pig to analyse data gathered from Yahoo search engines and Yahoo News Feeds.
- Processing huge data sets like Web
logs, streaming online data, etc.
- In customer behavior prediction
models like e-commerce websites.
3. What is the difference between logical and
physical plans?
♣ Tip: Approach this question by explaining
when does the logical and physical plans are created.
Pig undergoes some steps when a Pig Latin Script is
converted into MapReduce jobs by the compiler. Logical and Physical plans are
created during the execution of a pig script.
After performing the basic parsing and semantic checking,
the parser produces a logical plan and no data processing takes place during
the creation of a logical plan. The logical plan describes the logical
operators that have to be executed by Pig during execution. For each line in
the Pig script, syntax check is performed for operators and a logical plan is
created. If an error is encountered, an exception is thrown and the program
execution ends.
A logical plan contains a collection of operators in the
script, but does not contain the edges between the operators.
After the logical plan is generated, the script execution
moves to the physical plan where there is a description about the physical
operators, Apache Pig will use, to execute the Pig script. A physical plan is
like a series of MapReduce jobs, but the physical plan does not have any
reference on how it will be executed in MapReduce.
4. How Pig programming gets converted into
MapReduce jobs?
Pig is a high-level platform that makes many Hadoop data
analysis issues easier to execute. A program written in Pig Latin is a data
flow language, which need an execution engine to execute the query. So, when a
program is written in Pig Latin, Pig compiler converts the program into
MapReduce jobs.
5. What are the components of Pig
Execution Environment?
The components of Apache Pig Execution Environment are:
- Pig Scripts: Pig scripts are submitted to the Apache Pig execution
environment which can be written in Pig Latin using built-in operators and
UDFs can be embedded in it.
- Parser: The Parser does the type checking and checks the
syntax of the script. The parser outputs a DAG (directed acyclic graph).
DAG represents the Pig Latin statements and logical operators.
- Optimizer: The Optimizer performs the optimization
activities like split, merge, transform, reorder operators, etc. The
optimizer provides the automatic optimization feature to Apache Pig. The
optimizer basically aims to reduce the amount of data in the pipeline.
- Compiler: The Apache Pig compiler converts the optimized code
into MapReduce jobs automatically.
- Execution Engine: Finally, the MapReduce jobs are submitted to the
execution engine. Then, the MapReduce jobs are executed and the required
result is produced.
6. What are the different ways of executing
Pig script?
There are three ways to execute the Pig script:
- Grunt Shell: This is Pig’s interactive shell provided to execute
all Pig Scripts.
- Script File: Write all the Pig commands in a script file and
execute the Pig script file. This is executed by the Pig Server.
- Embedded Script: If some functions are unavailable in built-in
operators, we can programmatically create User Defined Functions (UDF) to
bring that functionality using other languages like Java, Python, Ruby,
etc. and embed it in the Pig Latin Script file. Then, execute that script
file.
7. What are the data types of Pig Latin?
Pig Latin can handle both atomic data types like int,
float, long, double etc. and complex data types like tuple, bag and map.
Atomic or scalar data types are the basic data types
which are used in all the languages like string, int, float, long, double,
char[], byte[]. These are also called the primitive data types.
The complex data types supported by Pig Latin are:
- Tuple: Tuple is an ordered set of fields which may contain
different data types for each field.
- Bag: A bag is a collection of a set of tuples and these
tuples are a subset of rows or entire rows of a table.
- Map: A map is key-value pairs used to represent data
elements. The key must be a chararray [] and should be unique like column
name, so it can be indexed and value associated with it can be accessed on
the basis of the keys. The value can be of any data type.
♣ Tip: Complex Data Types of Pig Latin are
very important to understand, so you can go through Apache Pig Tutorial blog and understand them in-depth.
8. What is a bag in Pig Latin?
A bag is one of the data models present in Pig. It is an
unordered collection of tuples with possible duplicates. Bags are used to store
collections of tuples while grouping. The size of bag is the size of the local
disk, this means that the size of the bag is limited. When the bag is full,
then Pig will spill this bag into local disk and keep only some parts of the
bag in memory. There is no necessity that the complete bag should fit into
memory. We represent bags with “{}”.
♣ Tip:You can also explain the two types of
bag in Pig Latin i.e. outer bag and inner bag, which may impress your
employers.
9. What do you understand by an inner bag and
outer bag in Pig?
Outer bag or relation is nothing but a bag of tuples.
Here relations are similar as relations in relational databases. For example:
{(Linkin Park, California), (Metallica, Los Angeles),
(Mega Death, Los Angeles)}
An inner bag contains a bag inside a tuple. For Example:
(Los Angeles, {(Metallica, Los Angeles), (Mega
Death, Los Angeles)})
(California, {(Linkin Park, California)})
10. How Apache Pig deals with the schema and
schema-less data?
♣ Tip: Apache Pig deals with both schema and
schema-less data. Thus, this is an important question to focus on.
The Apache Pig handles both, schema as well as
schema-less data.
- If the schema only includes the
field name, the data type of field is considered as a byte array.
- If you assign a name to the field
you can access the field by both, the field name and the positional
notation, whereas if field name is missing we can only access it by the
positional notation i.e. $ followed by the index number.
- If you perform any operation which
is a combination of relations (like JOIN, COGROUP, etc.) and if any of the
relation is missing schema, the resulting relation will have null schema.
- If the schema is null, Pig will
consider it as a byte array and the real data type of field will be
determined dynamically.
11. How do users interact with the shell in
Apache Pig?
Using Grunt i.e. Apache Pig’s interactive shell, users
can interact with HDFS or the local file system.
To start Grunt, users should use pig –x
local command . This command will prompt Grunt shell. To exit
from grunt shell, press CTRL+D or just type exit.
12. What is UDF?
If some functions are unavailable in built-in
operators, we can programmatically create User Defined Functions (UDF) to bring
that functionality using other languages like Java, Python, Ruby, etc. and
embed it in the Pig Latin Script file.
♣ Tip: To understand how to create and work with
UDF, go through this blog – creating UDF in Apache Pig.
♣ Tip: Important points about UDF to focus on:
- LoadFunc abstract class has three
main methods for loading data and for most use cases it would suffice to
extend it.
- LoadPush has methods to push
operations from Pig runtime into loader implementations.
- setUdfContextSignature() method
will be called by Pig both in the front end and back end to pass a unique
signature to the Loader.
- The load/store UDFs control how
data goes into Pig and comes out of Pig.
- The meaning of getNext() is called
by Pig runtime to get the next tuple in the data.
- The loader should use
setLocation() method to communicate the load information to the underlying
InputFormat.
- prepareToRead method enables the
RecordReader associated with the InputFormat provided by the LoadFunc is
passed to the LoadFunc. The RecordReader can then be used by the
implementation in getNext() to return a tuple representing a record of
data back to pig.
- pushProjection() method tells
LoadFunc which fields are required in the Pig script. Pig will use the
column index requiredField.index to communicate with the LoadFunc about
the fields required by the Pig script.
- LoadCaster has methods to convert
byte arrays to specific types.
- A loader implementation should
implement LoadCaster() if casts (implicit or explicit) from DataByteArray
fields to other types need to be supported. LoadCaster has methods to
convert byte arrays to specific types.
13. List the diagnostic operators in Pig.
Pig supports a number of diagnostic operators that you
can use to debug Pig scripts.
- DUMP: Displays the contents of a relation to the screen.
- DESCRIBE: Return the schema of a relation.
- EXPLAIN: Display the logical, physical, and MapReduce
execution plans.
- ILLUSTRATE: Gives the step-by-step execution of a sequence of
statements.
♣ Tip: Go through this blog on diagnostic operators, to understand them and see their
implementations.
14. Does ‘ILLUSTRATE’ run a MapReduce job?
No, illustrate will not pull any MapReduce, it will pull
the internal data. On the console, illustrate will not do any job. It just
shows the output of each stage and not the final output.
ILLUSTRATE operator is used to review how data is
transformed through a sequence of Pig Latin statements. ILLUSTRATE command is
your best friend when it comes to debugging a script. This command alone might
be a good reason for choosing Pig over something else.
Syntax: illustrate relation_name;
15. What does illustrate do in Apache
Pig?
Executing Pig scripts on large data sets, usually takes a
long time. To tackle this, developers run Pig scripts on sample data, but there
is possibility that the sample data selected, might not execute your Pig script
properly. For instance, if the script has a join operator there should be at
least a few records in the sample data that have the same key, otherwise the
join operation will not return any results.
To tackle these kind of issues, illustrate is used.
Illustrate takes a sample of the data and whenever it comes across operators
like join or filter that remove data, it ensures that only some records pass
through and some do not, by making modifications to the records such that they
meet the condition. Illustrate just shows the output of each stage but does not
run any MapReduce task.
16. List the relational operators in Pig.
All Pig Latin statements operate on relations (and
operators are called relational operators). Different relational operators in
Pig Latin are:
- COGROUP: Joins two or more tables and then perform GROUP
operation on the joined table result.
- CROSS: CROSS operator is used to compute the cross product
(Cartesian product) of two or more relations.
- DISTINCT: Removes duplicate tuples in a relation.
- FILTER: Select a set of tuples from a relation based on a
condition.
- FOREACH: Iterate the tuples of a relation, generating a data
transformation.
- GROUP: Group the data in one or more relations.
- JOIN: Join two or more relations (inner or outer join).
- LIMIT: Limit the number of output tuples.
- LOAD: Load data from the file system.
- ORDER: Sort a relation based on one or more fields.
- SPLIT: Partition a relation into two or more relations.
- STORE: Store data in the file system.
- UNION: Merge the content of two relations. To perform a
UNION operation on two relations, their columns and domains must be
identical.
♣ Tip: Go through this blog on relational operators, to understand them and see their implementations.
17. Is the keyword ‘DEFINE’ like a function
name?
Yes, the keyword ‘DEFINE’ is like a function name.
DEFINE statement is used to assign a name (alias) to a
UDF function or to a streaming command.
- The function has a long package
name that you don’t want to include in a script, especially if you call
the function several times in that script. The constructor for the
function takes string parameters. If you need to use different constructor
parameters for different calls to the function you will need to create
multiple defines – one for each parameter set.
- The streaming command
specification is complex. The streaming command specification requires
additional parameters (input, output, and so on). So, assigning an alias
makes it easier to access.
18. What is the function of co-group in Pig?
COGROUP takes members of different relations, binds them
by similar fields, and creates a bag that contains a single instance of both
relations where those relations have common fields. Co-group operation joins
the data set by grouping one particular data set only.
It groups the elements by their common field and then
returns a set of records containing two separate bags. The first bag consists
of the first data set record with the common data set and the second bag consists
of the second data set records with the common data set.
19. Can we say co-group is a group of more
than 1 data set?
Co-group is a group of data sets. More than one data set,
co-group will group all the data sets and join them based on the common field.
Hence, we can say that co-group is a group of more than one data set and join
of that data set as well.
20. The difference between GROUP and COGROUP operators in
Pig?
Group and Cogroup operators are identical. For
readability, GROUP is used in statements involving one relation and COGROUP is
used in statements involving two or more relations. Group operator collects all
records with the same key. Cogroup is a combination of group and join, it is a
generalization of a group instead of collecting records of one input depends on
a key, it collects records of n inputs based on a key. At a time, we can
Cogroup up to 127 relations.
21. You have a file personal_data.txt in the
HDFS directory with 100 records. You want to see only the first 5 records from
the employee.txt file. How will you do this?
For getting only 5 records from 100 records we use limit
operator.
First load the data in Pig:
personal_data = LOAD “/personal_data.txt” USING
PigStorage(‘,’) as (parameter1, Parameter2, …);
Then Limit the data to 5 records:
limit_data = LIMIT personal_data 5;
22. What is a MapFile?
MapFile is a class which serves file-based map from keys
to values.
A map is a directory containing two files, the data file,
containing all keys and values in the map, and a smaller index file, containing
a fraction of the keys. The fraction is determined by MapFile.Writer.getIndexInterval().
The index file is read entirely into memory. Thus, key
implementations should try to keep themselves small. Map files are created by
adding entries in-order.
23. What is BloomMapFile used for?
The BloomMapFile is a class that extends MapFile. So its
functionality is similar to MapFile. BloomMapFile uses dynamic Bloom filters to
provide quick membership test for the keys. It is used in Hbase table format.
24. What are the different execution modes
available in Pig?
The execution modes in Apache Pig are:
- MapReduce Mode: This is the default mode, which requires access to a
Hadoop cluster and HDFS installation. Since, this is a default mode, it is
not necessary to specify -x flag (you can execute pig OR pig -x
mapreduce). The input and output in this mode are present on HDFS.
- Local Mode: With access to a single machine, all files are
installed and run using a local host and file system. Here the local mode
is specified using ‘-x flag’ (pig -x local). The input and output in this
mode are present on local file system.
25. Is Pig script case sensitive?
♣ Tip: Explain the both aspects of Apache Pig
i.e. case-sensitive as well as case-insensitive aspect.
Pig script is both case sensitive and case insensitive.
User defined functions, the field name, and relations are
case sensitive i.e. EMPLOYEE is not same as employee or M=LOAD ‘data’ is not
same as M=LOAD ‘Data’.
Whereas Pig script keywords are case insensitive i.e.
LOAD is same as load.
It is difficult to say whether Apache Pig is case
sensitive or case insensitive. For instance, user defined functions, relations
and field names in Pig are case sensitive. On the other hand, keywords in
Apache Pig are case insensitive.
26. What does Flatten do in Pig?
Sometimes there is data in a tuple or a bag and if we
want to remove the level of nesting from that data, then Flatten modifier in
Pig can be used. Flatten un-nests bags and tuples. For tuples, the Flatten
operator will substitute the fields of a tuple in place of a tuple, whereas
un-nesting bags is a little complex because it requires creating new tuples.
27. What is Pig Statistics? What are all
stats classes in the Java API package available?
Pig Statistics is a framework for collecting and storing
script-level statistics for Pig Latin. Characteristics of Pig Latin scripts and
the resulting MapReduce jobs are collected while the script is executed. These
statistics are then available for Pig users and tools using Pig (such as Oozie)
to retrieve after the job is completed.
The stats classes are in the package org.apache.pig.tools.pigstats:
- PigStats
- JobStats
- OutputStats
- InputStats.
28. What are the limitations of the Pig?
Limitations of the Apache Pig are:
- As the Pig platform is designed
for ETL-type use cases, it’s not a better choice for real-time scenarios.
- Apache Pig is not a good choice
for pinpointing a single record in huge data sets.
- Apache Pig is built on top of
MapReduce, which is batch processing oriented.
Hadoop Interview Questions
– Apache Hive
Apache
Hive – A Brief Introduction:
Apache Hive is a data warehouse system built on top of
Hadoop and is used for analyzing structured and semi-structured data. It
provides a mechanism to project structure onto the data and perform
queries written in HQL (Hive Query Language) that are similar to SQL
statements. Internally, these queries or HQL gets converted to map reduce jobs
by the Hive compiler.
Apache
Hive Job Trends:
Today, many companies consider Apache Hive as a de facto
to perform analytics on large data sets. Also, since it supports SQL like query
statements, it is very much popular among people who are from a non –
programming background and wish to take advantage of Hadoop
MapReduce framework.
Now, let us have a look at the rising Apache Hive job
trends over the past few years:
I would suggest you to go through a dedicated blog
on Apache
Hive Tutorial to revise your concepts before
proceeding in this Apache Hive Interview Questions blog.
Apache
Hive Interview Questions
Here is the comprehensive list of the most frequently
asked Apache Hive Interview Questions that have been framed after deep research
and discussion with the industry experts.
1.
What kind of applications is supported by Apache Hive?
Hive supports all those client applications that are
written in Java, PHP, Python, C++ or Ruby by exposing its Thrift server.
2.
Define the difference between Hive and HBase?
The key differences between Apache Hive and HBase are as
follows:
- The Hive
is a data warehousing infrastructure whereas HBase is a
NoSQL database on top of Hadoop.
- Apache
Hive queries are executed as MapReduce jobs internally whereas HBase
operations run in a real-time on its database rather than MapReduce.
3. Where does the data of a Hive table gets stored?
By default, the Hive table is stored in an HDFS directory
– /user/hive/warehouse. One can change it by specifying the desired directory
in hive.metastore.warehouse.dir configuration parameter present in the
hive-site.xml.
4.
What is a metastore in Hive?
Metastore in Hive stores the meta data
information using RDBMS and an open source ORM (Object Relational Model) layer
called Data Nucleus which converts the object representation into relational
schema and vice versa.
5.
Why Hive does not store metadata information in HDFS?
Hive stores metadata information in the
metastore using RDBMS instead of HDFS. The reason for choosing RDBMS is to
achieve low latency as HDFS read/write operations are time consuming
processes.
6.
What is the difference between local and remote metastore?
Local Metastore:
In local metastore configuration, the metastore service
runs in the same JVM in which the Hive service is running and connects to a
database running in a separate JVM, either on the same machine or on a remote
machine.
Remote Metastore:
In the remote metastore configuration, the metastore
service runs on its own separate JVM and not in the Hive service JVM.
Other processes communicate with the metastore server using Thrift Network
APIs. You can have one or more metastore servers in this case to provide more
availability.
7.
What is the default database provided by Apache Hive for metastore?
By default, Hive provides an embedded Derby database
instance backed by the local disk for the metastore. This is called the
embedded metastore configuration.
8.
Scenario:
Suppose I have installed Apache Hive on top of my Hadoop
cluster using default metastore configuration. Then, what will happen if we
have multiple clients trying to access Hive at the same time?
The default metastore configuration allows only one
Hive session to be opened at a time for accessing the metastore. Therefore, if
multiple clients try to access the metastore at the same time, they will get an
error. One has to use a standalone metastore, i.e. Local or remote metastore
configuration in Apache Hive for allowing access to multiple clients
concurrently.
Following are the steps to configure MySQL database
as the local metastore in Apache Hive:
- One should make the following changes in hive-site.xml:
- javax.jdo.option.ConnectionURL property should be set to jdbc:mysql://host/dbname?createDataba
seIfNotExist=true. - javax.jdo.option.ConnectionDriverName property
should be set to com.mysql.jdbc.Driver.
- One should also set the username and password as:
- javax.jdo.option.ConnectionUserName is set to desired
username.
- javax.jdo.option.ConnectionPassword is set to the
desired password.
- The JDBC driver JAR file for MySQL must be on the Hive’s
classpath, i.e. The jar file should be copied into the Hive’s lib
directory.
- Now, after restarting the Hive shell, it will
automatically connect to the MySQL database which is running as a
standalone metastore.
9.
What is the difference between external table and managed table?
Here is the key difference between an external table and
managed table:
- In case of managed table, If one drops a managed
table, the metadata information along with the table data is deleted from
the Hive warehouse directory.
- On the contrary, in case of an external table, Hive just
deletes the metadata information regarding the table and leaves the table
data present in HDFS untouched.
Note: I would suggest you to go through
the blog on Hive
Tutorial to learn more about Managed Table and External Table in Hive.
10.
Is it possible to change the default location of a managed table?
Yes, it is possible to change the default location of a
managed table. It can be achieved by using the clause – LOCATION
‘<hdfs_path>’.
11.
When should we use SORT BY instead of ORDER BY ?
We should use SORT BY instead of ORDER BY when we have to
sort huge datasets because SORT BY clause sorts the data using multiple
reducers whereas ORDER BY sorts all of the data together using a single
reducer. Therefore, using ORDER BY against a large number of inputs will
take a lot of time to execute.
12.
What is a partition in Hive?
Hive organizes tables into partitions for grouping
similar type of data together based on a column or partition key. Each Table
can have one or more partition keys to identify a particular
partition. Physically, a partition is nothing but a sub-directory in the
table directory.
13.
Why do we perform partitioning in Hive?
Partitioning provides granularity in a Hive table and
therefore, reduces the query latency by scanning only relevant partitioned data instead of the
whole data set.
For example, we can partition a transaction
log of an e – commerce website based on month like Jan, February, etc. So, any
analytics regarding a particular month, say Jan, will have to scan the Jan
partition (sub – directory) only instead of the whole table data.
14.
What is dynamic partitioning and when is it used?
In dynamic partitioning values
for partition columns are known in the runtime, i.e. It is known during loading
of the data into a Hive table.
One may use dynamic partition in following two cases:
- Loading data from an existing non-partitioned table to
improve the sampling and therefore, decrease the query latency.
- When one does not know all the values of the partitions
before hand and therefore, finding these partition values manually from a
huge data sets is a tedious task.
15.
Scenario:
Suppose, I create a table that contains details of all
the transactions done by the customers of year 2016: CREATE
TABLE transaction_details (cust_id INT, amount FLOAT, month STRING, country
STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
Now, after inserting 50,000 tuples in this table, I want
to know the total revenue generated for each month. But, Hive is taking too
much time in processing this query. How
will you solve this problem and list the steps that I will be taking in order
to do so?
We can solve this problem of query latency by
partitioning the table according to each month. So, for each month we will be
scanning only the partitioned data instead of whole data sets.
As we know, we can’t partition an existing
non-partitioned table directly. So, we will be taking following steps to solve
the very problem:
- Create a
partitioned table, say partitioned_transaction:
CREATE
TABLE partitioned_transaction (cust_id INT, amount FLOAT, country STRING)
PARTITIONED BY (month STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY
‘,’ ;
2.
Enable dynamic partitioning in Hive:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
3. Transfer the data from the non – partitioned table
into the newly created partitioned table:
INSERT OVERWRITE TABLE partitioned_transaction PARTITION
(month) SELECT cust_id, amount, country, month FROM transaction_details;
Now, we can perform the query using each partition and
therefore, decrease the query time.
16.
How can you add a new partition for the month December in the above partitioned
table?
For adding a new partition in the above table
partitioned_transaction, we will issue the command give below:
ALTER TABLE partitioned_transaction ADD PARTITION
(month=’Dec’) LOCATION ‘/partitioned_transaction’;
Note: I suggest you to go through
the dedicated blog on Hive Commands where all the commands present in
Apache Hive have been explained with an example.
17.
What is the default maximum dynamic partition that can be created by a
mapper/reducer? How can you change it?
By default the number of maximum partition that can be
created by a mapper or reducer is set to 100. One can change it by issuing the
following command:
SET hive.exec.max.dynamic.partitions.pernode =
<value>
Note: You
can set the total number of dynamic partitions that can be created by one
statement by using: SET hive.exec.max.dynamic.partitions = <value>
18.
Scenario:
I am inserting data into a table based on partitions
dynamically. But, I received an error – FAILED ERROR IN SEMANTIC ANALYSIS:
Dynamic partition strict mode requires at least one static partition column. How
will you remove this error?
To remove this error one has to execute following
commands:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Things to Remember:
- By default, hive.exec.dynamic.partition configuration
property is set to False in case you are using Hive whose version is prior
to 0.9.0.
- hive.exec.dynamic.partition.mode is set to strict by
default. Only in non – strict mode Hive allows all partitions to be
dynamic.
19.
Why do we need buckets?
There are two main reasons for performing bucketing to a
partition:
- A map
side join requires
the data belonging to a unique join key to be present in the same
partition. But what about those cases where your partition key differs
from that of join key? Therefore, in these cases you can perform a
map side join by bucketing the table using the join key.
- Bucketing makes the sampling process more efficient and
therefore, allows us to decrease the query time.
20.
How Hive distributes the rows into buckets?
Hive determines the bucket number for a row by using the
formula: hash_function (bucketing_column)
modulo (num_of_buckets). Here, hash_function depends on the column
data type. For integer data type, the hash_function will be:
hash_function (int_type_column)= value of int_type_column
21.
What will happen in case you have not issued the command: ‘SET
hive.enforce.bucketing=true;’ before
bucketing a table in Hive in Apache Hive 0.x or 1.x?
The command: ‘SET
hive.enforce.bucketing=true;’ allows
one to have the correct number of reducer while using ‘CLUSTER BY’ clause for
bucketing a column. In case it’s not done, one may find the number of files
that will be generated in the table directory to be not equal to the number of
buckets. As an alternative, one may also set the number of reducer equal to the
number of buckets by using set mapred.reduce.task = num_bucket.
22.
What is indexing and why do we need it?
One of the Hive query optimization methods is Hive index.
Hive index is used to speed up the access of a column or set of columns in
a Hive database because with the use of index the database system does not need
to read all rows in the table to find the data that one has selected.
23.
Scenario:
Suppose, I have a CSV file – ‘sample.csv’ present in
‘/temp’ directory with the following entries:
id first_name last_name email gender ip_address
1 Hugh Jackman hughjackman@cam.ac.uk Male 136.90.241.52
2 David Lawrence dlawrence1@gmail.com Male
101.177.15.130
3 Andy Hall andyhall2@yahoo.com Female 114.123.153.64
4 Samuel Jackson samjackson231@sun.com Male 89.60.227.31
5 Emily Rose rose.emily4@surveymonkey.com Female
119.92.21.19
How will you consume this CSV file into the Hive
warehouse using built SerDe?
SerDe stands for serializer/deserializer. A SerDe allows
us to convert the unstructured bytes into a record that we can process using
Hive. SerDes are implemented using Java. Hive comes with several built-in
SerDes and many other third-party SerDes are also available.
Hive provides a specific SerDe for working with CSV
files. We can use this SerDe for the sample.csv by issuing following commands:
CREATE EXTERNAL TABLE sample
(id int, first_name string,
last_name string, email string,
gender string, ip_address string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
STORED AS TEXTFILE
LOCATION ‘/temp’;
Now, we can perform any query on the table ‘sample’:
SELECT first_name FROM sample WHERE gender = ‘male’;
24.
Scenario:
Suppose, I have a lot of small CSV files present in
/input directory in HDFS and I want to create a single Hive table corresponding
to these files. The data in these files are in the format: {id, name, e-mail,
country}. Now, as we know, Hadoop performance degrades when we use lots of
small files.
So,
how will you solve this problem where we want to create a single Hive table for
lots of small files without degrading the performance of the system?
One can use the SequenceFile format which will group
these small files together to form a single sequence file. The steps that
will be followed in doing so are as follows:
- Create a temporary table:
CREATE TABLE temp_table (id INT, name STRING, e-mail
STRING, country STRING)
ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS
TEXTFILE;
- Load the data into temp_table:
LOAD
DATA INPATH ‘/input’ INTO TABLE temp_table;
- Create a table that will store data in SequenceFile
format:
CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING,
country STRING)
ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS
SEQUENCEFILE;
- Transfer the data from the temporary table into the
sample_seqfile table:
INSERT OVERWRITE TABLE sample SELECT * FROM temp_table;
Hence, a single SequenceFile is generated which contains
the data present in all of the input files and therefore, the problem of
having lots of small files is finally eliminated.
Apache
HBase Interview Questions
Important
points to remember about Apache HBase:
- Apache HBase is a NoSQL column oriented database which
is used to store the sparse data sets. It runs on top of the Hadoop
distributed file system (HDFS) and it can store any kind of data.
- Clients can access HBase data through either a native
Java API, or through a Thrift or REST gateway, making it accessible from
any language.
♣ Tip: Before going through this
Apache HBase interview questions, I would suggest you to go through Apache HBase
Tutorial and HBase
Architecture to revise your HBase concepts.
Now moving on, let us look at the Apache HBase interview
questions.
1.
What are the key components of HBase?
The key components of HBase are Zookeeper, RegionServer
and HBase Master.
- Region
Server: A table can be divided into several regions. A
group of regions is served to the clients by a Region Server.
- HMaster:
It coordinates and manages the Region Servers (similar as NameNode manages
DataNodes in HDFS).
- ZooKeeper:
Zookeeper acts like as a coordinator inside HBase distributed environment.
It helps in maintaining server state inside the cluster by communicating
through sessions.
2.
When would you use HBase?
HBase is used in cases where we need random read and
write operations and it can perform a number of operations per second on a
large data sets. HBase gives strong data consistency. It can handle very large
tables with billions of rows and millions of columns on top of commodity
hardware cluster.
3.
What is the use of get() method?
get() method is used to read the data from the table.
4.
Define the difference between Hive and HBase?
Apache Hive is a data warehousing infrastructure built on
top of Hadoop. It helps in querying data stored in HDFS for analysis using Hive
Query Language (HQL), which is a SQL-like language, that gets translated into
MapReduce jobs. Hive performs batch processing on Hadoop.
Apache HBase is NoSQL key/value store which runs on top
of HDFS. Unlike Hive, HBase operations run in real-time on its database rather
than MapReduce jobs. HBase partitions the tables, and the tables are further
splitted into column families.
Hive and HBase are two different Hadoop based
technologies – Hive is an SQL-like engine that runs MapReduce jobs, and HBase
is a NoSQL key/value database of Hadoop. We can use them together. Hive can be
used for analytical queries while HBase for real-time querying. Data can even
be read and written from HBase to Hive and vice-versa.
5.
Explain the data model of HBase.
HBase consists of:
- Set of tables.
- Each table contains column families and rows.
- Row key acts as a Primary key in HBase.
- Any access to HBase tables uses this Primary Key.
- Each column qualifier present in HBase denotes
attributes corresponding to the object which resides in the cell.
6.
Define column families?
Column Family is a collection of columns, whereas row is
a collection of column families.
7.
Define standalone mode in HBase?
It is a default mode of HBase. In standalone mode, HBase
does not use HDFS—it uses the local filesystem instead—and it runs all HBase
daemons and a local ZooKeeper in the same JVM process.
8.
What is decorating Filters?
It is useful to modify, or extend, the behavior of a
filter to gain additional control over the returned data. These types of
filters are known as decorating filter. It includes SkipFilter and
WhileMatchFilter.
9.
What is RegionServer?
A table can be divided into several regions. A group of
regions is served to the clients by a Region Server.
10.
What are the data manipulation commands of HBase?
Data Manipulation commands of HBase are:
- put – Puts a cell value at a
specified column in a specified row in a particular table.
- get – Fetches the contents of
a row or a cell.
- delete – Deletes a cell value in a
table.
- deleteall – Deletes all the cells in a
given row.
- scan – Scans and returns the table
data.
- count – Counts and returns the number
of rows in a table.
- truncate – Disables, drops, and recreates
a specified table.
11.
Which code is used to open a connection in HBase?
Following code is used to open a HBase connection, here users is my HBase table:
1
2
|
Configuration
myConf = HBaseConfiguration.create(); HTable
table = new HTable(myConf, “users”); |
12.
What is the use of truncate command?
It is used to disable, drop and recreate the specified
tables.
♣
Tip: To
delete table first disable it, then delete it.
13.
Explain deletion in HBase? Mention what are the three types of tombstone
markers in HBase?
When you delete the cell in HBase, the data is not
actually deleted, but a tombstone marker is set, making the deleted cells
invisible. HBase deleted are actually removed during major compaction.
Three types of tombstone markers are there:
- Version delete marker: For deletion, it marks a single
version of a column
- Column delete marker: For deletion, it marks all the
versions of a column
- Family delete marker: For deletion, it marks of all
columns for a column family
14.
Explain how does HBase actually delete a row?
In HBase, whatever you write will be stored from RAM to
disk, these disk writes are immutable barring compaction. During the deletion
process in HBase, the major compaction process deletes the marker while minor
compaction don’t. In normal deletes, it will just a delete the tombstone
marker- these deleted data will be removed during compaction.
Also, if you delete data and add more data, but with an
earlier timestamp than the tombstone timestamp, further get() functions may be
masked by the delete/tombstone marker and hence you will not receive the
inserted value until the major compaction.
15.
Explain what happens if you alter the block size of a column family on an
already occupied database?
When you alter the block size of the column family, the
new data occupies the new block size while the old data remains within the old
block size. During data compaction, the old data will take the new block
size. New files as they are dumped, have a new block size, whereas
existing data will continue to be read correctly. All data should be
transformed to the new block size, after the next major compaction.
16.
HBase blocksize is configured on which level?
The blocksize is configured per column family and the
default value is 64 KB. This value can be changed as per requirements.
17.
Which command is used to run HBase Shell?
./bin/hbase shell command is used to run the HBase
shell. Execute this command in HBase directory.
18.
Which command is used to show the current HBase user?
whoami command is used to show HBase user.
19.
What is the full form of MSLAB?
MSLAB stands for Memstore-Local Allocation Buffer.
Whenever a request thread needs to insert data into a MemStore, it doesn’t
allocates the space for that data from the heap at large, but rather allocates
memory arena dedicated to the target region.
20.
Define LZO?
Lempel-Ziv-Oberhumer (LZO) is a lossless data compression
algorithm that focuses on decompression speed.
21.
What is HBase Fsck?
HBase comes with a tool called hbck which is implemented
by the HBaseFsck class. HBaseFsck (hbck) is a tool for checking for region
consistency and table integrity problems and repairing a corrupted HBase. It
works in two basic modes – a read-only inconsistency identifying mode and a
multi-phase read-write repair mode.
22.
What is REST?
Rest stands for Representational State Transfer which
defines the semantics so that the protocol can be used in a generic way to
address remote resources. It also provides support for different message
formats, offering many choices for a client application to communicate with the
server.
23.
What is Thrift?
Apache Thrift is written in C++, but provides schema
compilers for many programming languages, including Java, C++, Perl, PHP,
Python, Ruby, and more.
24.
What is Nagios?
Nagios is a very commonly used support tool for gaining
qualitative data regarding cluster status. It polls current metrics on a
regular basis and compares them with given thresholds.
25.
What is the use of ZooKeeper?
The ZooKeeper is used to maintain the configuration
information and communication between region servers and clients. It also
provides distributed synchronization. It helps in maintaining server state
inside the cluster by communicating through sessions.
Every Region Server along with HMaster Server sends
continuous heartbeat at regular interval to Zookeeper and it checks which
server is alive and available. It also provides server failure notifications so
that, recovery measures can be executed.
26.
Define catalog tables in HBase?
Catalog tables are used to maintain the metadata
information.
27.
Define compaction in HBase?
HBase combines HFiles to reduce the storage and reduce
the number of disk seeks needed for a read. This process is called compaction.
Compaction chooses some HFiles from a region and combines them. There are two
types of compactions.
- Minor
Compaction: HBase automatically picks smaller HFiles and
recommits them to bigger HFiles.
- Major
Compaction: In Major compaction, HBase merges and
recommits the smaller HFiles of a region to a new HFile.
28.
What is the use of HColumnDescriptor class?
HColumnDescriptor stores the information about a column
family like compression settings, number of versions etc. It is used as input
when creating a table or adding a column.
29.
Which filter accepts the pagesize as the parameter in hBase?
PageFilter accepts the pagesize as the parameter.
Implementation of Filter interface that limits results to a specific page size.
It terminates scanning once the number of filter-passed the rows greater than
the given page size.
Syntax: PageFilter (<page_size>)
30.
How will you design or modify schema in HBase programmatically?
HBase schemas can be created or updated using the Apache
HBase Shell or by using Admin in the Java API.
Creating table schema:
1
2
3
4
5
6
7
8
9
10
11
12
|
Configuration
config = HBaseConfiguration.create(); HBaseAdmin
admin = new HBaseAdmin(conf); // execute command through
admin</span></pre> //
Instantiating table descriptor class HTableDescriptor
t1 = new HTableDescriptor(TableName.valueOf("employee")); // Adding
column families to t1 t1.addFamily(new
HColumnDescriptor("professional")); t1.addFamily(new
HColumnDescriptor("personal")); // Create
the table through admin admin.createTable(t1); |
♣ Tip: Tables must be disabled
when making ColumnFamily modifications.
For
modification:
1
2
3
4
|
String
table = “myTable”; admin.disableTable(table);
admin.modifyColumn(table,
cf2); // modifying existing ColumnFamily admin.enableTable(table); |
31.What
are the filters are available in Apache HBase?
The filters that are supported by HBase are:
- ColumnPrefixFilter:
takes a single argument, a column prefix. It returns only those key-values
present in a column that starts with the specified column prefix.
- TimestampsFilter:
takes a list of timestamps. It returns those key-values whose timestamps
match any of the specified timestamps.
- PageFilter:
takes one argument, a page size. It returns page size, number of rows from
the table.
- MultipleColumnPrefixFilter:
takes a list of column prefixes. It returns key-values that are present in
a column that starts with any of the specified column prefixes.
- ColumnPaginationFilter:
takes two arguments, a limit and an offset. It returns limit number of
columns after offset number of columns. It does this for all the rows.
- SingleColumnValueFilter:
takes a column family, a qualifier, a comparison operator and a
comparator. If the specified column is not found, all the columns of that
row will be emitted. If the column is found and the comparison with the
comparator returns true, all the columns of the row will be emitted.
- RowFilter:
takes a comparison operator and a comparator. It compares each row key
with the comparator using the comparison operator and if the comparison
returns true, it returns all the key-values in that row.
- QualifierFilter:
takes a comparison operator and a comparator. It compares each qualifier
name with the comparator using the comparison operator and if the
comparison returns true, it returns all the key-values in that column.
- ColumnRangeFilter:
takes either minColumn, maxColumn, or both. Returns only those keys with
columns that are between minColumn and maxColumn. It also takes two
boolean variables to indicate whether to include the minColumn and
maxColumn or not. If you don’t want to set the minColumn or the maxColumn,
you can pass in an empty argument.
- ValueFilter:
takes a comparison operator and a comparator. It compares each value with
the comparator using the compare operator and if the comparison returns
true, it returns that key-value.
- PrefixFilter:
takes a single argument, a prefix of a row key. It returns only those
key-values present in a row that start with the specified row prefix.
- SingleColumnValueExcludeFilter:
takes the same arguments and behaves same as SingleColumnValueFilter.
However, if the column is found and the condition passes, all the columns
of the row will be omitted except for the tested column value.
- ColumnCountGetFilter:
takes one argument, a limit. It returns the first limit number of columns
in the table.
- InclusiveStopFilter:
takes one argument, a row key on which to stop scanning. It returns all
key-values present in rows up to and including the specified row.
- DependentColumnFilter:
takes two arguments required arguments, a family and a qualifier. It tries
to locate this column in each row and returns all key-values in that row
that have the same timestamp.
- FirstKeyOnlyFilter:
takes no arguments. Returns the key portion of the first key-value pair.
- KeyOnlyFilter:
takes no arguments. Returns the key portion of each key-value pair.
- FamilyFilter:
takes a comparison operator and comparator. It compares each family name
with the comparator using the comparison operator and if the comparison
returns true, it returns all the key-values in that family.
- CustomFilter:
You can create a custom filter by implementing the Filter class.
32.
How do we back up a HBase cluster?
There are two broad strategies for performing HBase
backups: backing up with a full cluster shutdown, and backing up on a live
cluster. Each approach has benefits and limitation.
Full Shutdown Backup
Some environments can tolerate a periodic full shutdown
of their HBase cluster, for example, if it is being used as a back-end process
and not serving front-end webpages.
- Stop
HBase: Stop the HBase services first.
- Distcp:
Distcp could be used to either copy the contents of the HBase directory in
HDFS to either the same cluster in another directory, or to a different
cluster.
- Restore:
The backup of the HBase directory from HDFS is copied onto the ‘real’
HBase directory via distcp. The act of copying these files, creates new
HDFS metadata, which is why a restore of the NameNode edits from the time
of the HBase backup isn’t required for this kind of restore, because it’s
a restore (via distcp) of a specific HDFS directory (i.e., the HBase part)
not the entire HDFS file-system.
Live Cluster Backup
The environments which cannot handle downtime uses Live
Cluster Backup.
- CopyTable:
Copy table utility could either be used to copy data from one table to
another on the same cluster, or to copy data to another table on another
cluster.
- Export:
Export approach dumps the content of a table to HDFS on the same cluster.
33.
How HBase Handles the write failure?
Failures are common in large distributed systems, and
HBase is no exception.
If the server hosting a MemStore that has not yet been
flushed crashes. The data that was in memory, but not yet persisted are lost.
HBase safeguards against that by writing to the WAL before the write completes.
Every server that’s part of the.
HBase cluster keeps a WAL to record changes as they
happen. The WAL is a file on the underlying file system. A write isn’t
considered successful until the new WAL entry is successfully written. This
guarantee makes HBase as durable as the file system backing it. Most of the
time, HBase is backed by the Hadoop Distributed Filesystem (HDFS). If HBase
goes down, the data that were not yet flushed from the MemStore to the HFile
can be recovered by replaying the WAL.
34.
While reading data from HBase, from which three places data will be reconciled
before returning the value?
The read process will go through the following process
sequentially:
- For reading the data, the scanner first looks for the
Row cell in Block cache. Here all the recently read key value pairs are
stored.
- If Scanner fails to find the required result, it moves
to the MemStore, as we know this is the write cache memory. There, it
searches for the most recently written files, which has not been dumped
yet in HFile.
- At last, it will use bloom filters and block cache to
load the data from the HFile.
35.
Can you explain data versioning?
In addition to being a schema-less database, HBase is
also versioned.
Every time you perform an operation on a cell, HBase
implicitly stores a new version. Creating, modifying and deleting a cell are
all treated identically, they are all new versions. When a cell exceeds the
maximum number of versions, the extra records are dropped during the major
compaction.
Instead of deleting an entire cell, you can operate on a
specific version within that cell. Values within a cell are versioned and it is
identified the timestamp. If a version is not mentioned, then the current
timestamp is used to retrieve the version. The default number of cell version is
three.
36.
What is a Bloom filter and how does it help in searching rows?
HBase supports Bloom Filter to improve the overall
throughput of the cluster. A HBase Bloom Filter is a space efficient
mechanism to test whether a HFile contains a specific row or row-col cell.
Without Bloom Filter, the only way to decide if a row key
is present in a HFile is to check the HFile’s block index, which stores
the start row key of each block in the HFile. There are many rows drops between
the two start keys. So, HBase has to load the block and scan the block’s keys
to figure out if that row key actually exists.
Top
50 Hadoop Interview Questions for 2017
1. Explain “Big Data” and what are five V’s of Big Data?
“Big
data” is the term for a collection of large and complex data sets, that makes
it difficult to process using relational database management tools or
traditional data processing applications. It is difficult to capture, curate,
store, search, share, transfer, analyze, and visualize Big data. Big Data has
emerged as an opportunity for companies. Now they can successfully derive
value from their data and will have a distinct advantage over their competitors
with enhanced business decisions making capabilities.
♣ Tip: It will be a good idea to
talk about the 5Vs in such questions, whether it is asked specifically or not!
Volume:
The volume represents the amount of data which is growing at an exponential
rate i.e. in Petabytes and Exabytes.
Velocity:
Velocity refers to the rate at which data is growing, which is very fast.
Today, yesterday’s data are considered as old data. Nowadays, social media is a
major contributor in the velocity of growing data.
Variety:
Variety refers to the heterogeneity of data types. In another word, the data
which are gathered has a variety of formats like videos, audios, csv, etc. So,
these various formats represent the variety of data.
Veracity: Veracity refers
to the data in doubt or uncertainty of data available due to data inconsistency
and incompleteness. Data available can sometimes get messy and maybe difficult
to trust. With many forms of big data, quality and accuracy are difficult to
control. The volume is often the reason behind for the lack of quality and
accuracy in the data.
Value:
It is all well and good to have access to big data but unless we can turn it
into a value it is useless. By turning it into value I mean, Is it adding to
the benefits of the organizations? Is the organization working on Big Data
achieving high ROI (Return On Investment)? Unless, it adds to their profits by
working on Big Data, it is useless.
As
we know Big Data is growing at an accelerating rate, so the factors associated
with it are also evolving. To go through them and understand it in detail, I
recommend you to go through Big Data
Tutorial blog.
2. What is Hadoop and its components.
When
“Big Data” emerged as a problem, Apache Hadoop evolved as a solution to it.
Apache Hadoop is a framework which provides us various services or tools to
store and process Big Data. It helps in analyzing Big Data and making business
decisions out of it, which can’t be done efficiently and effectively using
traditional systems.
♣ Tip: Now, while explaining
Hadoop, you should also explain the main components of Hadoop, i.e.:
Storage
unit– HDFS (NameNode, DataNode)
Processing
framework– YARN (ResourceManager, NodeManager)
3. What are HDFS and YARN?
HDFS (Hadoop Distributed File System)
is the storage unit of Hadoop. It is responsible for storing different kinds of
data as blocks in a distributed environment. It follows master and slave
topology.
♣ Tip: It is recommended to
explain the HDFS components too i.e.
NameNode: NameNode is the master node in
the distributed environment and it maintains the metadata information for the
blocks of data stored in HDFS like block location, replication factors etc.
DataNode: DataNodes are the slave nodes,
which are responsible for storing data in the HDFS. NameNode manages all the
DataNodes.
YARN (Yet Another Resource Negotiator)
is the processing framework in Hadoop, which manages resources and provides an
execution environment to the processes.
♣ Tip: Similarly,
as we did in HDFS, we should also explain the two components of YARN:
ResourceManager: It receives the processing
requests, and then passes the parts of requests to corresponding NodeManagers
accordingly, where the actual processing takes place. It allocates resources to
applications based on the needs.
NodeManager: NodeManager is installed on every
DataNode and it is responsible for execution of the task on every single
DataNode.
If
you want to learn in detail about HDFS & YARN go through Hadoop Tutorial blog.
4. Tell me
about the various Hadoop daemons and their roles in a Hadoop cluster.
Generally
approach this question by first explaining the HDFS daemons i.e. NameNode,
DataNode and Secondary NameNode, and then moving on to the YARN daemons i.e.
ResorceManager and NodeManager, and lastly explaining
the JobHistoryServer.
NameNode: It is the master node which is
responsible for storing the metadata of all the files and directories. It has
information about blocks, that make a file, and where those blocks are located
in the cluster.
Datanode: It is the slave node that
contains the actual data.
Secondary
NameNode: It
periodically merges the changes (edit log) with the FsImage (Filesystem Image),
present in the NameNode. It stores the modified FsImage into persistent
storage, which can be used in case of failure of NameNode.
ResourceManager: It is the central authority that
manages resources and schedule applications running on top of YARN.
NodeManager: It runs on slave machines, and is
responsible for launching the application’s containers (where applications
execute their part), monitoring their resource usage (CPU, memory, disk,
network) and reporting these to the ResourceManager.
JobHistoryServer: It maintains information about
MapReduce jobs after the Application Master terminates.
Hadoop
HDFS
5. Compare HDFS
with Network Attached Storage (NAS).
In
this question, first explain NAS and HDFS, and then compare their features as
follows:
- Network-attached
storage (NAS) is a file-level computer data storage server connected to a
computer network providing data access to a heterogeneous group of
clients. NAS can either be a hardware or software which provides services
for storing and accessing files. Whereas Hadoop Distributed File System
(HDFS) is a distributed filesystem to store data using commodity hardware.
- In HDFS
Data Blocks are distributed across all the machines in a cluster. Whereas
in NAS data is stored on a dedicated hardware.
- HDFS is
designed to work with MapReduce paradigm, where computation is moved to
the data. NAS is not suitable for MapReduce since data is stored
separately from the computations.
- HDFS
uses commodity hardware which is cost effective, whereas a NAS is a
high-end storage devices which includes high cost.
6. What are
the basic differences between relational database and HDFS?
- Here are
the key differences between HDFS and relational database:
RDBMS
|
Hadoop
|
|
Data Types
|
RDBMS
relies on the structured data and the schema of the data is always known.
|
Any kind of
data can be stored into Hadoop i.e. Be it structured, unstructured or
semi-structured.
|
Processing
|
RDBMS
provides limited or no processing capabilities.
|
Hadoop
allows us to process the data which is distributed across the cluster in a
parallel fashion.
|
Schema on Read Vs.
Write
|
RDBMS is based on ‘schema on write’ where schema
validation is done before loading the data.
|
On the contrary, Hadoop follows the schema on read
policy.
|
Read/Write Speed
|
In RDBMS,
reads are fast because the schema of the data is already known.
|
The writes
are fast in HDFS because no schema validation happens during HDFS write.
|
Cost
|
Licensed
software, therefore, I have to pay for the software.
|
Hadoop is
an open source framework. So, I don’t need to pay for the software.
|
Best Fit Use Case
|
RDBMS is
used for OLTP (Online Trasanctional Processing) system.
|
Hadoop is
used for Data discovery, data analytics or OLAP system.
|
7. List the difference between Hadoop 1 and Hadoop 2.
This
is an important question and while answering this question, we have to mainly
focus on two points i.e. Passive NameNode and YARN architecture.
In
Hadoop 1.x, “NameNode” is the single point of failure. In Hadoop 2.x, we have
Active and Passive “NameNodes”. If the active “NameNode” fails, the passive
“NameNode” takes charge. Because of this, high availability can be achieved in
Hadoop 2.x.
Also,
in Hadoop 2.x, YARN provides a central resource manager. With YARN, you can now
run multiple applications in Hadoop, all sharing a common resource. MRV2 is a
particular type of distributed application that runs the MapReduce framework on
top of YARN. Other tools can also perform data processing via YARN, which was a
problem in Hadoop 1.x.
8. What are active and passive “NameNodes”?
In
HA (High Availability) architecture, we have two NameNodes – Active “NameNode”
and Passive “NameNode”.
- Active
“NameNode” is the “NameNode” which works and runs in the cluster.
- Passive
“NameNode” is a standby “NameNode”, which has similar data as active
“NameNode”.
When
the active “NameNode” fails, the passive “NameNode” replaces the active
“NameNode” in the cluster. Hence, the cluster is never without a “NameNode” and
so it never fails.
9. Why does
one remove or add nodes in a Hadoop cluster frequently?
One
of the most attractive features of the Hadoop framework is its utilization
of commodity hardware. However, this leads to frequent “DataNode”
crashes in a Hadoop cluster. Another striking feature of Hadoop Framework is
the ease of scale in accordance with the rapid growth in
data volume. Because of these two reasons, one of the most common task of a
Hadoop administrator is to commission (Add) and decommission (Remove) “Data
Nodes” in a Hadoop Cluster.
Read
this blog to get a detailed understanding on commissioning
and decommissioning nodes in a Hadoop cluster.
10. What happens when two clients try to access the same file in the
HDFS?
HDFS supports
exclusive writes only.
When
the first client contacts the “NameNode” to open the file for writing, the
“NameNode” grants a lease to the client to create this file. When the second
client tries to open the same file for writing, the “NameNode” will notice that
the lease for the file is already granted to another client, and will reject
the open request for the second client.
11. How does NameNode tackle DataNode failures?
NameNode
periodically receives a Heartbeat (signal) from each of the DataNode in the
cluster, which implies DataNode is functioning properly.
A
block report contains a list of all the blocks on a DataNode. If a DataNode
fails to send a heartbeat message, after a specific period of time it is marked
dead.
The
NameNode replicates the blocks of dead node to another DataNode using the
replicas created earlier.
12. What will you do when NameNode is down?
The
NameNode recovery process involves the following steps to make the
Hadoop cluster up and running:
1.
Use the file system metadata replica (FsImage) to start a new NameNode.
2. Then, configure the DataNodes and clients so that they can acknowledge this new NameNode, that is started.
3. Now the new NameNode will start serving the client after it has completed loading the last checkpoint FsImage (for metadata information) and received enough block reports from the DataNodes.
2. Then, configure the DataNodes and clients so that they can acknowledge this new NameNode, that is started.
3. Now the new NameNode will start serving the client after it has completed loading the last checkpoint FsImage (for metadata information) and received enough block reports from the DataNodes.
Whereas,
on large Hadoop clusters this NameNode recovery process may consume a lot of
time and this becomes even a greater challenge in the case of the routine
maintenance. Therefore, we have HDFS High Availability Architecture which
is covered in the HA
architecture blog.
13. What is a
checkpoint?
In
brief, “Checkpointing” is a process that takes an FsImage, edit log and
compacts them into a new FsImage. Thus, instead of replaying an edit log, the
NameNode can load the final in-memory state directly from the FsImage. This is
a far more efficient operation and reduces NameNode startup time. Checkpointing
is performed by Secondary NameNode.
14. How is
HDFS fault tolerant?
When
data is stored over HDFS, NameNode replicates the data to several DataNode. The
default replication factor is 3. You can change the configuration factor as per
your need. If a DataNode goes down, the NameNode will automatically copy the
data to another node from the replicas and make the data available. This
provides fault tolerance in HDFS.
15. Can
NameNode and DataNode be a commodity hardware?
The
smart answer to this question would be, DataNodes are commodity hardware like
personal computers and laptops as it stores data and are required in a large
number. But from your experience you can tell that, NameNode is the master node
and it stores metadata about all the blocks stored in HDFS. It requires high
memory (RAM) space, so NameNode needs to be a high-end machine with good memory
space.
16. Why do we
use HDFS for applications having large data sets and not when there are a lot
of small files?
HDFS
is more suitable for large amounts of data sets in a single file as compared to
small amount of data spread across multiple files. As you know, the NameNode
stores the metadata information regarding file system in the RAM. Therefore,
the amount of memory produces a limit to the number of files in my HDFS file
system. In other words, too much of files will lead to generation of too much
meta data. And, storing these meta data in the RAM will become a challenge. As
a thumb rule, metadata for a file, block or directory takes 150 bytes.
17. How do you
define “block” in HDFS? What is the default block size in Hadoop 1 and in
Hadoop 2? Can it be changed?
Blocks
are the nothing but the smallest continuous location on your hard drive
where data is stored. HDFS stores each as blocks, and distribute it across the
Hadoop cluster. Files in HDFS are broken down into block-sized chunks, which
are stored as independent units.
Hadoop
1 default block size: 64 MB
Hadoop
2 default block size: 128 MB
Yes,
blocks can be configured. The dfs.block.size parameter can be used in the
hdfs-site.xml file to set the size of a block in a Hadoop environment.
18. What does
‘jps’ command do?
The
‘jps’ command helps us to check if the Hadoop daemons are running or not. It
shows all the Hadoop daemons i.e namenode, datanode, resourcemanager,
nodemanager etc. that are running on the machine.
19. How do you
define “Rack Awareness” in Hadoop?
Rack
Awareness is
the algorithm in which the “NameNode” decides how blocks and their replicas are
placed, based on rack definitions to minimize network traffic between
“DataNodes” within the same rack. Let’s say we consider replication factor 3
(default), the policy is that “for every block of data, two copies will exist
in one rack, third copy in a different rack”. This rule is known as the
“Replica Placement Policy”.
To
know rack awareness in more detail, refer to the HDFS
architecture blog.
20. What is
“speculative execution” in Hadoop?
If
a node appears to be executing a task slower, the master node can redundantly
execute another instance of the same task on another node. Then, the task which
finishes first will be accepted and the other one is killed. This process is
called “speculative execution”.
21. How can I
restart “NameNode” or all the daemons in Hadoop?
This
question can have two answers, we will discuss both the answers. We can restart
NameNode by following methods:
- You can
stop the NameNode individually using. /sbin /hadoop-daemon.sh stop namenode command and then
start the NameNode using. /sbin/hadoop-daemon.sh start
namenode command.
- To stop
and start all the daemons, use. /sbin/stop-all.sh and
then use ./sbin/start-all.sh command
which will stop all the daemons first and then start all the daemons.
These
script files reside in the sbin directory inside the Hadoop directory.
22. What is
the difference between an “HDFS Block” and an “Input Split”?
The
“HDFS Block” is the physical division of the data while “Input Split” is the
logical division of the data. HDFS divides data in blocks for storing the
blocks together, whereas for processing, MapReduce divides the data into the
input split and assign it to mapper function.
23. Name the
three modes in which Hadoop can run.
The
three modes in which Hadoop can run are as follows:
- Standalone (local) mode:
This is the default mode if we don’t configure anything. In this mode, all
the components of Hadoop, such NameNode, DataNode, ResourceManager, and
NodeManager, run as a single Java process. This uses local filesystem.
- Pseudo distributed mode:
A single-node Hadoop deployment is considered as running Hadoop system in
pseudo-distributed mode. In this mode, all the Hadoop services, including
both the master and the slave services, were executed on a single compute
node.
- Fully distributed mode:
A Hadoop deployments in which the Hadoop master and slave services run on
separate nodes, are stated as fully distributed mode.
Hadoop
MapReduce
24. What is
“MapReduce”? What is the syntax to run a “MapReduce” program?
It
is a framework/a programming model that is used for processing large data sets
over a cluster of computers using parallel programming. The syntax to run a
MapReduce program is hadoop_jar_file.jar /input_path /output_path.
If
you have any doubt in MapReduce or want to revise your concepts you can refer
this MapReduce
tutorial.
25. What are
the main configuration parameters in a “MapReduce” program?
The
main configuration parameters which users need to specify in “MapReduce”
framework are:
- Job’s
input locations in the distributed file system
- Job’s
output location in the distributed file system
- Input
format of data
- Output
format of data
- Class
containing the map function
- Class
containing the reduce function
- JAR file
containing the mapper, reducer and driver classes
26. State the
reason why we can’t perform “aggregation” (addition) in mapper? Why do we need
the “reducer” for this?
This
answer includes many points, so we will go through them sequentially.
- We cannot
perform “aggregation” (addition) in mapper because sorting does not occur
in the “mapper” function. Sorting occurs only on the reducer side and
without sorting aggregation cannot be done.
- During
“aggregation”, we need output of all the mapper functions which may not be
possible to collect in the map phase as mappers may be running on
different machine where the data blocks are stored.
- And
lastly, if we try to aggregate data at mapper, it requires communication
between all mapper functions which may be running on different machines.
So, it will consume high network bandwidth and can cause network
bottlenecking.
27. What is
the purpose of “RecordReader” in Hadoop?
The
“InputSplit” defines a slice of work, but does not describe how to access it.
The “RecordReader” class loads the data from its source and converts it into
(key, value) pairs suitable for reading by the “Mapper” task. The
“RecordReader” instance is defined by the “Input Format”.
28. Explain
“Distributed Cache” in a “MapReduce Framework”.
Distributed
Cache can be explained as, a facility provided by the MapReduce framework to
cache files needed by applications. Once you have cached a file for your job,
Hadoop framework will make it available on each and every data nodes where you
map/reduce tasks are running. Then you can access the cache file as a local
file in your Mapper or Reducer job.
29. How do
“reducers” communicate with each other?
This
is a tricky question. The “MapReduce” programming model does not allow
“reducers” to communicate with each other. “Reducers” run in isolation.
30.
What does a “MapReduce Partitioner” do?
A
“MapReduce Partitioner” makes sure that all the values of a single key go to
the same “reducer”, thus allowing even distribution of the map output over the
“reducers”. It redirects the “mapper” output to the “reducer” by determining
which “reducer” is responsible for the particular key.
31. How will
you write a custom partitioner?
Custom
partitioner for a Hadoop job can be written easily by following the below
steps:
- Create a
new class that extends Partitioner Class
- Override
method – getPartition, in the wrapper that runs in the MapReduce.
- Add the
custom partitioner to the job by using method set Partitioner or add the
custom partitioner to the job as a config file.
32. What is a
“Combiner”?
A
“Combiner” is a mini “reducer” that performs the local “reduce” task. It
receives the input from the “mapper” on a particular “node” and sends the
output to the “reducer”. “Combiners” help in enhancing the efficiency of
“MapReduce” by reducing the quantum of data that is required to be sent to the
“reducers”.
33. What do
you know about “SequenceFileInputFormat”?
“SequenceFileInputFormat”
is an input format for reading within sequence files. It is a specific
compressed binary file format which is optimized for passing the data between
the outputs of one “MapReduce” job to the input of some other “MapReduce” job.
Sequence
files can be generated as the output of other MapReduce tasks and are an
efficient intermediate representation for data that is passing from one
MapReduce job to another.

Apache
Pig
34. What are
the benefits of Apache Pig over MapReduce?
Apache
Pig is a platform, used to analyze large data sets representing them as data
flows developed by Yahoo. It is designed to provide an abstraction over
MapReduce, reducing the complexities of writing a MapReduce program.
- Pig
Latin is a high-level data flow language, whereas MapReduce is a low-level
data processing paradigm.
- Without
writing complex Java implementations in MapReduce, programmers can achieve
the same implementations very easily using Pig Latin.
- Apache
Pig reduces the length of the code by approx 20 times (according
to Yahoo). Hence, this reduces the development period by almost 16
times.
- Pig
provides many built-in operators to support data operations like joins,
filters, ordering, sorting etc. Whereas to perform the same function in
MapReduce is a humongous task.
- Performing
a Join operation in Apache Pig is simple. Whereas it is difficult in
MapReduce to perform a Join operation between the data sets, as it
requires multiple MapReduce tasks to be executed sequentially to fulfill
the job.
- In addition,
pig also provides nested data types like tuples, bags, and maps that are
missing from MapReduce.
35. What are
different data types in Pig Latin?
Pig
Latin can handle both atomic data types like int, float, long, double etc. and
complex data types like tuple, bag and map.
Atomic
data types: Atomic or scalar data types are the basic data types which are used
in all the languages like string, int, float, long, double, char[], byte[].
Complex
Data Types: Complex data types are Tuple, Map and Bag.
To
know more about these data types, you can go through our Pig
tutorial blog.
36. What are the different relational operations in “Pig Latin” you
worked with?
Different
relational operators are:
- for each
- order by
- filters
- group
- distinct
- join
- limit
37. What is a
UDF?
If
some functions are unavailable in built-in operators, we can programmatically
create User Defined Functions (UDF) to bring those functionalities using other
languages like Java, Python, Ruby, etc. and embed it in Script file.
Apache
Hive
38. What is
“SerDe” in “Hive”?
Apache
Hive is a data warehouse system built on top of Hadoop and is used for
analyzing structured and semi-structured data developed by Facebook. Hive
abstracts the complexity of Hadoop MapReduce.
The
“SerDe” interface allows you to instruct “Hive” about how a record should be
processed. A “SerDe” is a combination of a “Serializer” and a “Deserializer”.
“Hive” uses “SerDe” (and “FileFormat”) to read and write the table’s row.
To
know more about Apache Hive, you can go through this Hive
tutorial blog.
39. Can the
default “Hive Metastore” be used by multiple users (processes) at the same
time?
“Derby
database” is the default “Hive Metastore”. Multiple users (processes) cannot
access it at the same time. It is mainly used to perform unit tests.
40. What is
the default location where “Hive” stores table data?
The
default location where Hive stores table data is inside HDFS in
/user/hive/warehouse.
Apache
HBase
41. What is
Apache HBase?
HBase
is an open source, multidimensional, distributed, scalable and a NoSQL database
written in Java. HBase runs on top of HDFS (Hadoop Distributed File System) and
provides BigTable (Google) like capabilities to Hadoop. It is designed to
provide a fault tolerant way of storing large collection of sparse data sets.
HBase achieves high throughput and low latency by providing faster Read/Write
Access on huge data sets.
To
know more about HBase you can go through our HBase
tutorial blog.
42. What are
the components of Apache HBase?
HBase
has three major components, i.e. HMaster Server, HBase RegionServer and Zookeeper.
- Region Server: A table can
be divided into several regions. A group of regions is served to the
clients by a Region Server.
- HMaster: It coordinates
and manages the Region Server (similar as NameNode manages DataNode in
HDFS).
- ZooKeeper: Zookeeper acts
like as a coordinator inside HBase distributed environment. It helps in
maintaining server state inside the cluster by communicating through
sessions.
To
know more, you can go through this HBase
architecture blog.
43. What are
the components of Region Server?
The
components of a Region Server are:
- WAL: Write Ahead Log (WAL)
is a file attached to every Region Server inside the distributed
environment. The WAL stores the new data that hasn’t been persisted or
committed to the permanent storage.
- Block Cache: Block Cache
resides in the top of Region Server. It stores the frequently read data in
the memory.
- MemStore: It is the write
cache. It stores all the incoming data before committing it to the disk or
permanent memory. There is one MemStore for each column family in a
region.
- HFile: HFile is stored in
HDFS. It stores the actual cells on the disk.
44. Explain
“WAL” in HBase?
Write
Ahead Log (WAL) is a file attached to every Region Server inside the
distributed environment. The WAL stores the new data that hasn’t been persisted
or committed to the permanent storage. It is used in case of failure to recover
the data sets.
45. Mention
the differences between “HBase” and “Relational Databases”?
HBase
is an open source, multidimensional, distributed, scalable and a NoSQL database written in Java. HBase runs on top of
HDFS and provides BigTable like capabilities to Hadoop. Let us see the
differences between HBase and relational database.
HBase
|
Relational
Database
|
It is
schema-less
|
It is
schema based database
|
It is
column-oriented data store
|
It is
row-oriented data store
|
It is used
to store de-normalized data
|
It is used
to store normalized data
|
It contains
sparsely populated tables
|
It contains
thin tables
|
Automated
partitioning is done is HBase
|
There is no
such provision or built-in support for partitioning
|
Apache
Spark
46. What is
Apache Spark?
The
answer to this question is, Apache Spark is a framework for real time data
analytics in a distributed computing environment. It executes in-memory
computations to increase the speed of data processing.
It
is 100x faster than MapReduce for large scale data processing by
exploiting in-memory computations and other optimizations.
47. Can you
build “Spark” with any particular Hadoop version?
Yes,
one can build “Spark” for a specific Hadoop version. Check out this blog to
learn more about building
YARN and HIVE on Spark.
48. Define
RDD.
RDD
is the acronym for Resilient Distribution Datasets – a fault-tolerant
collection of operational elements that run parallel. The partitioned data in
RDD are immutable and distributed, which is a key component of Apache Spark.
Oozie
& ZooKeeper
49. What is
Apache ZooKeeper and Apache Oozie?
Apache
ZooKeeper coordinates with various services in a distributed environment. It
saves a lot of time by performing synchronization, configuration maintenance,
grouping and naming.
Apache
Oozie is a scheduler which schedules Hadoop jobs and binds them together as one
logical work. There are two kinds of Oozie jobs:
- Oozie Workflow: These
are sequential set of actions to be executed. You can assume it as a relay
race. Where each athlete waits for the last one to complete his part.
- Oozie Coordinator: These
are the Oozie jobs which are triggered when the data is made available to
it. Think of this as the response-stimuli system in our body. In the same
manner as we respond to an external stimulus, an Oozie coordinator
responds to the availability of data and it rests otherwise.
50. How do you
configure an “Oozie” job in Hadoop?
“Oozie”
is integrated with the rest of the Hadoop stack supporting several types of
Hadoop jobs such as “Java MapReduce”, “Streaming MapReduce”, “Pig”, “Hive” and
“Sqoop”.
To
understand “Oozie” in detail and learn how to configure an “Oozie” job, do
check out this introduction to Apache
Oozie blog.
Feeling
overwhelmed with all the questions the interviewer might ask in
your Hadoop interview? Now it is time to go through a series of
Hadoop interview questions which covers different aspects of Hadoop framework.
It’s never too late to strengthen your basics. Learn Hadoop from industry
experts while working with real-life use cases.

3 comments:
This information is really useful to me.
hadoop interview questions
hadoop interview questions and answers online
hadoop interview questions and answers pdf online
hadoop interview questions online
frequently asked hadoop interview questions
Your post is very great.I read this post. It's very helpful. I will definitely go ahead and take advantage of this. You absolutely have wonderful stories. Cheers for sharing with us your blog.
Python training in Noida
https://www.trainingbasket.in
This post is very useful for me,keep sharing more Posts.
Thank you...
big data online course
Post a Comment