Thursday, 23 February 2017

Hadoop Administration Interview Questions

                             Hadoop Cluster Interview Questions

1. Which are the modes in which Hadoop can run?
We have three modes in which Hadoop can run and that are:
  • Standalone (local) mode: Default mode of Hadoop, it uses the local file system for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files.
  • Pseudo-distributed mode: In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave nodes are on the same machine.
  • Fully distributed mode: This is the production phase of Hadoop where data is distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slaves.
2. What are the features of Standalone (local) mode?
In stand-alone mode, there are no daemons, everything runs on a single JVM. It has no DFS and it utilizes the local file system. Stand-alone mode is suitable only for running MapReduce programs during development for testing. It is one of the most least used environments.
3. What are the features of Pseudo mode?
Pseudo mode is used in both for development and in the testing environment. In the Pseudo mode, all the daemons run on the same machine.
4. What are the features of Fully Distributed mode?
This is an important question as Fully Distributed mode is used in the production environment, where we have ‘n’ number of machines forming a Hadoop cluster. Hadoop daemons run on a cluster of machines. There is one host onto which Namenode is running and other hosts on which Datanodes are running. NodeManagers are installed on every DataNode and it is responsible for execution of the task on every single DataNode. All these NodeManagers are managed by ResourceManager, which receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly.
5. What is configured in /etc/hosts and what is its role in setting Hadoop cluster?
This is a technical question which challenges your basic concept. /etc/hosts file contains the hostname and their IP address of that host. It maps the IP address to the hostname. In Hadoop cluster, we store all the hostnames (master and slaves) with their IP addresses in /etc/hosts so, that we can use hostnames easily instead of IP addresses.
6. What are the default port numbers of NameNode, ResourceManager & MapReduce JobHistory Server?
You are expected to remember basic server port numbers if you are working with Hadoop. The port number for corresponding daemons is as follows:
Namenode – ’50070’
ResourceManager – ’8088’
MapReduce JobHistory Server – ’19888’.
7. What are the main Hadoop configuration files?
Tip: Generally, approach this question by telling the 4 main configuration files in Hadoop and giving their brief descriptions to show your expertise.
  • core-site.xml: core-site.xml informs Hadoop daemon where NameNode runs on the cluster. It contains configuration settings of Hadoop core such as I/O settings that are common to HDFS & MapReduce.
  • hdfs-site.xml: hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode, Secondary NameNode). It also includes the replication factor and block size of HDFS.
  • mapred-site.xml: mapred-site.xml contains configuration settings of the MapReduce framework like number of JVM that can run in parallel, the size of the mapper and the reducer, CPU cores available for a process, etc.
  • yarn-site.xml: yarn-site.xml contains configuration settings of ResourceManager and NodeManager like application memory management size, the operation needed on program & algorithm, etc.
These files are in the conf/hadoop/ directory inside Hadoop directory.
8. How does Hadoop CLASSPATH plays vital role in starting or stopping in Hadoop daemons?
Tip: To check your knowledge on Hadoop the interviewer may ask you this question.
CLASSPATH includes all the directories containing jar files required to start/stop Hadoop daemons. The CLASSPATH is set inside /etc/hadoop/hadoop-env.sh file.
9. What is a spill factor with respect to the RAM?
Tip: This is a theoretical question, but if you add a practical taste to it, you might get a preference.
The map output is stored in an in-memory buffer; when this buffer is almost full, then spilling phase starts in order to move data to a temp folder.
Map output is first written to buffer and buffer size is decided by mapreduce.task.io.sort.mb property .By default, it will be 100 MB.
When the buffer reaches certain threshold, it will start spilling buffer data to disk. This threshold is specified in mapreduce.map.sort.spill.percent .
10. What is command to extract the compressed file in tar.gz format?
This is an easy question, tar -xvf /file_location/filename.tar.gz command will extract the tar.gz compressed file.
11. How will you check Java and Hadoop is installed on your system?
By using the following commands we can check whether Java and Hadoop are installed and their paths are set inside .bashrc file:
For checking Java – java -version 
For checking Hadoop – hadoop version
12. What is the default replication factor and how will you change it?
The default replication factor is 3.
Tip: Default Replication Factor could be changed in three ways. Answering all the three ways will show your expertise.
  1. By adding this property to hdfs-site.xml:
1
2
3
4
5
<property>
<name>dfs.replication</name>
<value>5</value>
<description>Block Replication</description>
</property>
  1. Or you can change the replication factor on per file basis using following command:
hadoop fs –setrep –w 3 /file_location
  1. Or you can change replication factor for all the files in a directory using the following command:
hadoop fs –setrep –w 3 -R /directory_location
13. What is the full form of fsck?
The full form of fsck is File System Check. HDFS supports the fsck (filesystem check) command to check for various inconsistencies. It is designed for reporting the problems with the files in HDFS, for example, missing blocks of a file or under-replicated blocks.
14. Which are the main hdfs-site.xml properties?
The three main hdfs-site.xml properties are:
  1. dfs.name.dir gives you the location where NameNode stores the metadata (FsImage and edit logs) and where DFS is located – on the disk or onto the remote directory.
  2. dfs.data.dir which gives you the location of DataNodes, where the data is going to be stored.
  3. fs.checkpoint.dir is the directory on the filesystem where the Secondary NameNode stores the temporary images of edit logs, which is to be merged and the FsImage for backup.

15. What happens if you get a ‘connection refused java exception’ when you type hadoop fsck /?

If you get a ‘connection refused java exception’ when you type hadoop fsck, it could mean that the NameNode is not working on your VM.
16. How can we view the compressed files via HDFS command?
We can view compressed files in HDFS using hadoop fs -text /filename command.
17. What is the command to move into safe mode and exit safe mode?
Tip: Approach this question by first explaining safe mode and then moving on to the commands.
Safe Mode in Hadoop is a maintenance state of the NameNode during which NameNode doesn’t allow any changes to the file system. During Safe Mode, HDFS cluster is read-only and doesn’t replicate or delete blocks.
  • To know the status of safe mode, you can use the command: hdfs dfsadmin -safemode get
  • To enter safe mode: hdfs dfsadmin -safemode enter
  • To exit safe mode: hdfs dfsadmin -safemode leave
18. What does ‘jps’ command does?
jps command is used to check all the Hadoop daemons like NameNode, DataNode, ResourceManager, NodeManager etc. which are running on the machine.
19. How can I restart Namenode?
This question has two answers, answering both will give you a plus point. We can restart NameNode by following methods:
  1. You can stop the NameNode individually using. /sbin /hadoop-daemon.sh stop namenode command and then start the NameNode using./sbin/hadoop-daemon.sh start namenode
  2. Use ./sbin/stop-all.shand and then use ./sbin/start-all.sh command which will stop all the daemons first and then start all the daemons.
20. How can we check whether NameNode is working or not?
To check whether NameNode is working or not, use the jps command, this will show all the running Hadoop daemons and there you can check whether NameNode daemon is running or not.
21. How can we look at the Namenode in the web browser UI?
If you want to look for NameNode in the browser, the port number for NameNode web browser UI is 50070. We can check in web browser using http://master:50070/dfshealth.jsp.
22. What are the different commands used to startup and shutdown Hadoop daemons?
Tip: Explain all the three ways of stopping and starting Hadoop daemons, this will show your expertise.
  1. ./sbin/start-all.sh to start all the Hadoop daemons and ./sbin/stop-all.sh to stop all the Hadoop daemons.
  2. Then you can start all the dfs daemons together using. /sbin/start-dfs.sh, yarn daemons together using. /sbin/start-yarn.sh and MR job history server using. /sbin/mr-jobhistory-daemon.sh start historyserver. To stop these daemons similarly we can use. /sbin/start-yarn.sh./sbin/start-yarn.sh &. /sbin/mr-jobhistory-daemon.sh stop historyserver.
  3. The last way is to start all the daemons individually and stop them individually:
./sbin/hadoop-daemon.sh start namenode
./sbin/hadoop-daemon.sh start datanode
./sbin/yarn-daemon.sh start resourcemanager
./sbin/yarn-daemon.sh start nodemanager
./sbin/mr-jobhistory-daemon.sh start historyserver
and stop them similarly.
23. What do slaves file consist of?
Slaves file consists of a list of hosts, one per line and the list contains DataNode location on which Node Manager servers run.
24. What do masters file consist of?
The masters file contains Secondary NameNode server location.

25. What does hadoop-env.sh do?

hadoop-env.sh provides the environment for Hadoop to run. For example, JAVA_HOME, CLASSPATH etc. are set over here.

26. Where is hadoop-env.sh file present?

As we discussed earlier, where all the configuration files reside, thus hadoop-env.sh file is present in the  /etc/hadoop directory.

27. In Hadoop_PID_DIR, what does PID stands for? What does it do?

PID stands for ‘Process ID’. This directory stores the Process ID of the servers that are running.

28. What does hadoop-metrics.properties file do?

Tip: As this file is configured manually only in special cases, so answering this question will impress the interviewer indicating your expertise about configuration files.
hadoop-metrics.properties is used for ‘Performance Reporting‘ purposes. It controls the reporting for Hadoop. The API is abstract so that it can be implemented on top of a variety of metrics client libraries. The choice of client library is a configuration option, and different modules within the same application can use different metrics implementation libraries. This file is stored inside /etc/hadoop.

29. What are the network requirements for Hadoop?

You should answer this question as; the Hadoop core uses Shell (SSH) for communication with salve and to launch the server processes on the slave nodes. It requires a password-less SSH connection between the master and all the slaves and the secondary machines, so every time it does not have to ask for authentication as master and slave requires rigorous communication.

30. Why do we need a password-less SSH in Fully Distributed environment?

We need a password-less SSH in a Fully-Distributed environment because when the cluster is live and running in Fully Distributed environment, the communication is too frequent. The DataNode and the NodeManager should be able to send messages quickly to master server.

31. Does this lead to security issues?

No, not at all. Hadoop cluster is an isolated cluster and generally, it has nothing to do with the internet. It has a different kind of a configuration. We needn’t worry about that kind of a security breach, for instance, someone hacking through the internet, and so on. Hadoop has a very secured way to connect to other machines to fetch and to process data.

32. On which port does SSH work?

SSH works on Port No. 22, though it can be configured. 22 is the default Port number.

33. Can you tell us more about SSH?

SSH is nothing but a secure shell communication, it is a kind of a protocol that works on a Port No. 22, and when you do an SSH, what you really require is a password, to connect to the other machine. SSH is not only between masters and slaves, but can be between two hosts.

34. What happens to a NameNode, when ResourceManager is down?

When a ResourceManager is down, it will not be functional (for submitting jobs) but NameNode will be present. So, the cluster is accessible if NameNode is working, even if the ResourceManager is not working.

35. How can we format HDFS?

Tip: Attempt this question by starting with the command to format the HDFS and then exlain what this command does.
Hadoop distributed file system(HDFS) can be formatted using bin/hadoop namenode -format command. This command formats the HDFS via NameNode. This command is only executed for the first time. Formatting the file system means initializing the directory specified by the dfs.name.dir variable. If you run this command on existing filesystem, you will lose all your data stored on your NameNode. Formatting a Namenode will not format the DataNode. It will format the FsImage and edit logs data stored on the NameNode and will lose the data about the location of blocks stored in HDFS.
Never format, up and running Hadoop filesystem. You will lose all your data stored in the HDFS.

36. Can we use Windows for Hadoop?

Red Hat Linux and Ubuntu are the best Operating Systems for Hadoop. Windows is not used frequently for installing Hadoop as there are many support problems attached with Windows. Thus, Windows is not a preferred environment for Hadoop.

                                                Hadoop MapReduce Interview Questions
MapReduce is a programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment.
  • MapReduce 2.0 or YARN Architecture:
    • MapReduce framework also follows Master/Slave Topology where the master node (Resource Manager) manages and tracks various MapReduce jobs being executed on the slave nodes (Node Mangers).
    • Resource Manager consists of two main components:
      • Application Manager: It accepts job-submissions, negotiates the container for ApplicationMaster and handles failures while executing MapReduce jobs.
      • Scheduler: Scheduler allocates resources that are required by various MapReduce applications running on the Hadoop cluster.
  • How MapReduce job works:
    • As the name MapReduce suggests, reducer phase takes place after mapper phase has been completed.
    • So, the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs.
    • The reducer receives the key-value pair from multiple map jobs.
    • Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.
“A mind troubled by doubt cannot focus on the course to victory.”
                                                                 
                                                                  – Arthur Golden

1. What do you mean by data locality?
Data locality talks about moving computation unit to data rather data to the computation unit. MapReduce framework achieves data locality by processing data locally i.e. processing of the data happens in the very node by Node Manager where data blocks are present. 
2. Is it mandatory to set input and output type/format in MapReduce?
No, it is not mandatory to set the input and output type/format in MapReduce. By default, the cluster takes the input and the output type as ‘text’.
3. Can we rename the output file?
Yes, we can rename the output file by implementing multiple format output class.
4. What do you mean by shuffling and sorting in MapReduce?
Shuffling and sorting takes place after the completion of map task where the input to the every reducer is sorted according to the keys. Basically, the process by which the system sorts the key-value output of the map tasks and transfer it to the reducer is called shuffle.
5. Explain the process of spilling in MapReduce?
The output of a map task is written into a circular memory buffer (RAM). The default size of buffer is set to 100 MB  which can be tuned by using mapreduce.task.io.sort.mb property. Now, spilling is a process of copying the data from memory buffer to disc when the content of the buffer reaches a certain threshold size. By default, a background thread starts spilling the contents from memory to disc after 80% of the buffer size is filled. Therefore, for a 100 MB size buffer the spilling will start after the content of the buffer reach a size of 80 MB.
Note: One can change this spilling threshold using mapreduce.map.sort.spill.percent which is set to 0.8 or 80% by default.
6. What is a distributed cache in MapReduce Framework?
Distributed Cache can be explained as, a facility provided by the MapReduce framework to cache files needed by applications. Once you have cached a file for your job, Hadoop framework will make it available on each and every data nodes where you map/reduce tasks are running. Therefore, one can access the cache file as a local file in your Mapper or Reducer job.
7. What is a combiner and where you should use it?
Combiner is like a mini reducer function that allows us to perform a local aggregation of map output before it is transferred to reducer phase. Basically, it is used to optimize the network bandwidth usage during a MapReduce task by cutting down the amount of data that is transferred from a mapper to the reducer.
8. Why the output of map tasks are stored (spilled) into local disc and not in HDFS?
The outputs of map task are the intermediate key-value pairs which is then processed by reducer to produce the final aggregated result. Once a MapReduce job is completed, there is no need of the intermediate output produced by map tasks. Therefore, storing these intermediate output into HDFS and replicate it will create unnecessary overhead.
9. What happens when the node running the map task fails before the map output has been sent to the reducer?
In this case, map task will be assigned a new node and whole task will be run again to re-create the map output.  
10. Define Speculative Execution
If a node appears to be executing a task slower than expected, the master node can redundantly execute another instance of the same task on another node. Then, the task which finishes first will be accepted whereas other task will be killed. This process is called speculative execution.
11. What is the role of a MapReduce Partitioner?
A partitioner divides the intermediate key-value pairs produced by map tasks into partition. The total number of partition is equal to the number of reducers where each partition is processed by the corresponding reducer. The partitioning is done using the hash function based on a single key or group of keys. The default partitioner available in Hadoop is HashPartitioner.
12. How can we assure that the values regarding a particular key goes to the same reducer?
By using a partitioner we can control that a particular key – value goes to the same reducer for processing.  
13. What is the difference between Input Split and HDFS block?
HDFS block defines how the data is physically divided in HDFS whereas input split defines the logical boundary of the records required for processing it.
14. What do you mean by InputFormat?
InputFormat describes the input-specification for a MapReduce job.The MapReduce framework relies on the InputFormat of the job to:
  • Validate the input-specification of the job.
  • Split-up the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.
  • Provide the RecordReader implementation used to read records from the logical InputSplit for processing by the Mapper.
15. What is the purpose of TextInputFormat?
TextInputFormat is the default input format present in the MapReduce framework. In TextInputFormat, an input file is produced as keys of type LongWritable (byte offset of the beginning of the line in the file) and values of type Text (content of the line).
16. What is the role of RecordReader in Hadoop MapReduce?
InputSplit defines a slice of work, but does not describe how to access it. The “RecordReader” class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper” task. The “RecordReader” instance is defined by the “Input Format”.
17. What are the various configuration parameters required to run a MapReduce job?
The main configuration parameters which users need to specify in “MapReduce” framework are:
  • Job’s input locations in the distributed file system
  • Job’s output location in the distributed file system
  • Input format of data
  • Output format of data
  • Class containing the map function
  • Class containing the reduce function
  • JAR file containing the mapper, reducer and driver classes
18. When should you use SequenceFileInputFormat?
SequenceFileInputFormat is an input format for reading within sequence files. It is a specific compressed binary file format which is optimized for passing the data between the outputs of one “MapReduce” job to the input of some other “MapReduce” job.
Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce job to another.
19. What is an identity Mapper and Identity Reducer?
Identity mapper is the default mapper provided by the Hadoop framework. It runs when no mapper class has been defined in the MapReduce program where it simply passes the input key – value pair for the reducer phase.
Like Identity Mapper, Identity Reducer is also the default reducer class provided by the Hadoop, which is automatically executed if no reducer class has been defined. It also performs no computation or process, rather it just simply write the input key – value pair into the specified output directory. 
20. What is a map side join?
Map side join is a process where two data sets are joined by the mapper.
21. What are the advantages of using map side join in MapReduce?
The advantages of using map side join in MapReduce are as follows:
  • Map-side join helps in minimizing the cost that is incurred for sorting and merging in the shuffle and reduce stages.
  • Map-side join also helps in improving the performance of the task by decreasing the time to finish the task.
22. What is reduce side join in MapReduce?
As the name suggests, in the reduce side join, the reducer is responsible for performing the join operation. It is comparatively simple and easier to implement than the map side join as the sorting and shuffling phase sends the values having identical keys to the same reducer and therefore, by default, the data is organized for us.
Tip: I would suggest you to go through a dedicated blog on reduce side join in MapReduce where the whole process of reduce side join is explained in detail with an example.
23. What do you know about NLineInputFormat?
NLineInputFormat splits ‘n’ lines of input as one split.
24. Is it legal to set the number of reducer task to zero? Where the output will be stored in this case?
Yes, It is legal to set the number of reduce-tasks to zero if there is no need for a reducer. In this case the outputs of the map task is directly stored into the HDFS which is specified in the setOutputPath(Path). 
25. Is it necessary to write a MapReduce job in Java?
No, MapReduce framework supports multiple languages like Python, Ruby etc. 
26. How do you stop a running job gracefully?
One can gracefully stop a MapReduce job by using the command: hadoop job -kill JOBID
27. How will you submit extra files or data ( like jars, static files, etc. ) for a MapReduce job during runtime?
The distributed cache is used to distribute large read-only files that are needed by map/reduce jobs to the cluster. The framework will copy the necessary files from a URL on to the slave node before any tasks for the job are executed on that node. The files are only copied once per job and so should not be modified by the application.
28. How does inputsplit in MapReduce determines the record boundaries correctly?
RecordReader is responsible for providing the information regarding record boundaries in an input split. 
29. How do reducers communicate with each other?
This is a tricky question. The “MapReduce” programming model does not allow “reducers” to communicate with each other. “Reducers” run in isolation.

I hope you find this blog on Hadoop MapReduce Interview Questions to be informative and helpful. You are welcome to mention your doubts and feedback in the comment section given below. In this blog, I have covered the interview questions for MapReduce only. To save your time in visiting several sites for interview questions related to each Hadoop component, we have prepared a series of interview question blogs that covers all the components present in Hadoop framework. Kindly, refer to the links given below to explore all the Hadoop related interview question and strengthen your fundamentals:

Apache Pig Interview Questions
Looking out for Apache Pig Interview Questions that are frequently asked by employers? Here is the fifth blog of Hadoop Interview Questions series, which covers Apache PIG interview questions. I hope you must not have missed the earlier blogs of our Hadoop Interview Question series.
After going through the Pig interview questions, you will get an in-depth knowledge of questions that are frequently asked by employers in Hadoop interviews.  
In case you have attended Pig interviews previously, we encourage you to add your questions in the comments tab. We will be happy to answer them, and spread the word to the community of fellow job seekers.
Important points to remember about Apache Pig:
Apache Pig is a platform, used to analyze large data sets representing them as data flows. It is designed to provide an abstraction over MapReduce, reducing the complexities of writing a MapReduce task using Java programming. We can perform data manipulation operations very easily in Hadoop using Apache Pig.
Apache Pig has two main components – the Pig Latin language and the Pig Run-time Environment, in which Pig Latin programs are executed.
Apache Pig follows ETL (Extract Transform Load) process. It can handle inconsistent schema (in case of unstructured data).
Apache Pig automatically optimizes the tasks before execution, i.e. automatic optimization. Apache Pig handles all kinds of data.
Pig allows programmers to write custom functions which is unavailable in Pig. User Defined Functions (UDF) can be written in different language like Java, Python, Ruby, etc. and embed them in Pig script.
Pig Latin provides various built-in operators like join, sort, filter, etc. to read, write, and process large data sets.
Tip: Before going through this Apache Pig interview questions, I would suggest you to go through Apache Pig Tutorial to revise your Pig concepts.
Now moving on, let us look at the Apache Pig interview questions.
1. Highlight the key differences between MapReduce and Apache Pig.
Tip: In this question, you should explain what were the problems with MapReduce which led to the development of Apache Pig by Yahoo.
The following are the key differences between Apache Pig and MapReduce due to which Apache Pig came into picture:
  • Apache Pig is a high-level data flow platform, whereas MapReduce is a low-level data processing paradigm.
  • Without writing complex Java implementations in MapReduce, programmers can achieve the same implementations very easily using Pig Latin.
  • Apache Pig provides nested data types like tuples, bags, and maps that are missing from MapReduce.
  • Pig provides many built-in operators to support data operations like joins, filters, ordering, sorting etc. Whereas to perform the same function in MapReduce is a humongous task.
2. What are the use cases of Apache Pig?
Apache Pig is used for analyzing and performing tasks involving ad-hoc processing. Apache Pig is used for:
  • Research on large raw data sets like data processing for search platforms. For example, Yahoo uses Apache Pig to analyse data gathered from Yahoo search engines and Yahoo News Feeds. 
  • Processing huge data sets like Web logs, streaming online data, etc.
  • In customer behavior prediction models like e-commerce websites.
3. What is the difference between logical and physical plans?
Tip: Approach this question by explaining when does the logical and physical plans are created. 
Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs by the compiler. Logical and Physical plans are created during the execution of a pig script.
After performing the basic parsing and semantic checking, the parser produces a logical plan and no data processing takes place during the creation of a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. For each line in the Pig script, syntax check is performed for operators and a logical plan is created. If an error is encountered, an exception is thrown and the program execution ends.
A logical plan contains a collection of operators in the script, but does not contain the edges between the operators.
After the logical plan is generated, the script execution moves to the physical plan where there is a description about the physical operators, Apache Pig will use, to execute the Pig script. A physical plan is like a series of MapReduce jobs, but the physical plan does not have any reference on how it will be executed in MapReduce.
4. How Pig programming gets converted into MapReduce jobs?
Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. A program written in Pig Latin is a data flow language, which need an execution engine to execute the query. So, when a program is written in Pig Latin, Pig compiler converts the program into MapReduce jobs.
5. What are the components of Pig Execution Environment?
The components of Apache Pig Execution Environment are:
  • Pig Scripts: Pig scripts are submitted to the Apache Pig execution environment which can be written in Pig Latin using built-in operators and UDFs can be embedded in it.
  • Parser: The Parser does the type checking and checks the syntax of the script. The parser outputs a DAG (directed acyclic graph). DAG represents the Pig Latin statements and logical operators.
  • Optimizer: The Optimizer performs the optimization activities like split, merge, transform, reorder operators, etc. The optimizer provides the automatic optimization feature to Apache Pig. The optimizer basically aims to reduce the amount of data in the pipeline.
  • Compiler: The Apache Pig compiler converts the optimized code into MapReduce jobs automatically.
  • Execution Engine: Finally, the MapReduce jobs are submitted to the execution engine. Then, the MapReduce jobs are executed and the required result is produced.
6. What are the different ways of executing Pig script?
There are three ways to execute the Pig script:
  • Grunt Shell: This is Pig’s interactive shell provided to execute all Pig Scripts.
  • Script File: Write all the Pig commands in a script file and execute the Pig script file. This is executed by the Pig Server.
  • Embedded Script: If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring that functionality using other languages like Java, Python, Ruby, etc. and embed it in the Pig Latin Script file. Then, execute that script file.
7. What are the data types of Pig Latin?
Pig Latin can handle both atomic data types like int, float, long, double etc. and complex data types like tuple, bag and map.
Atomic or scalar data types are the basic data types which are used in all the languages like string, int, float, long, double, char[], byte[]. These are also called the primitive data types.
The complex data types supported by Pig Latin are:
  • Tuple: Tuple is an ordered set of fields which may contain different data types for each field.
  • Bag: A bag is a collection of a set of tuples and these tuples are a subset of rows or entire rows of a table.
  • Map: A map is key-value pairs used to represent data elements. The key must be a chararray [] and should be unique like column name, so it can be indexed and value associated with it can be accessed on the basis of the keys. The value can be of any data type.
Tip: Complex Data Types of Pig Latin are very important to understand, so you can go through Apache Pig Tutorial blog and understand them in-depth.
8. What is a bag in Pig Latin?
A bag is one of the data models present in Pig. It is an unordered collection of tuples with possible duplicates. Bags are used to store collections of tuples while grouping. The size of bag is the size of the local disk, this means that the size of the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag in memory. There is no necessity that the complete bag should fit into memory. We represent bags with “{}”.
Tip:You can also explain the two types of bag in Pig Latin i.e. outer bag and inner bag, which may impress your employers.
9. What do you understand by an inner bag and outer bag in Pig?
Outer bag or relation is nothing but a bag of tuples. Here relations are similar as relations in relational databases. For example:
{(Linkin Park, California), (Metallica, Los Angeles), (Mega Death, Los Angeles)}
An inner bag contains a bag inside a tuple. For Example:
(Los Angeles, {(Metallica, Los Angeles), (Mega Death, Los Angeles)})
(California, {(Linkin Park, California)})
10. How Apache Pig deals with the schema and schema-less data?
Tip: Apache Pig deals with both schema and schema-less data. Thus, this is an important question to focus on.
The Apache Pig handles both, schema as well as schema-less data.
  • If the schema only includes the field name, the data type of field is considered as a byte array.
  • If you assign a name to the field you can access the field by both, the field name and the positional notation, whereas if field name is missing we can only access it by the positional notation i.e. $ followed by the index number.
  • If you perform any operation which is a combination of relations (like JOIN, COGROUP, etc.) and if any of the relation is missing schema, the resulting relation will have null schema.
  • If the schema is null, Pig will consider it as a byte array and the real data type of field will be determined dynamically.
11. How do users interact with the shell in Apache Pig?
Using Grunt i.e. Apache Pig’s interactive shell, users can interact with HDFS or the local file system.
To start Grunt, users should use pig –x local command . This command will prompt Grunt shell. To exit from grunt shell, press CTRL+D or just type exit.
12. What is UDF?
If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring that functionality using other languages like Java, Python, Ruby, etc. and embed it in the Pig Latin Script file.
Tip: To understand how to create and work with UDF, go through this blog  creating UDF in Apache Pig.
Tip: Important points about UDF to focus on:
  • LoadFunc abstract class has three main methods for loading data and for most use cases it would suffice to extend it.
  • LoadPush has methods to push operations from Pig runtime into loader implementations.
  • setUdfContextSignature() method will be called by Pig both in the front end and back end to pass a unique signature to the Loader.
  • The load/store UDFs control how data goes into Pig and comes out of Pig.
  • The meaning of getNext() is called by Pig runtime to get the next tuple in the data.
  • The loader should use setLocation() method to communicate the load information to the underlying InputFormat.
  • prepareToRead method enables the RecordReader associated with the InputFormat provided by the LoadFunc is passed to the LoadFunc. The RecordReader can then be used by the implementation in getNext() to return a tuple representing a record of data back to pig.
  • pushProjection() method tells LoadFunc which fields are required in the Pig script. Pig will use the column index requiredField.index to communicate with the LoadFunc about the fields required by the Pig script.
  • LoadCaster has methods to convert byte arrays to specific types.
  • A loader implementation should implement LoadCaster() if casts (implicit or explicit) from DataByteArray fields to other types need to be supported. LoadCaster has methods to convert byte arrays to specific types.
13. List the diagnostic operators in Pig.
Pig supports a number of diagnostic operators that you can use to debug Pig scripts.
  • DUMP: Displays the contents of a relation to the screen.
  • DESCRIBE: Return the schema of a relation.
  • EXPLAIN: Display the logical, physical, and MapReduce execution plans.
  • ILLUSTRATE: Gives the step-by-step execution of a sequence of statements.
Tip: Go through this blog on diagnostic operators, to understand them and see their implementations.
14. Does ‘ILLUSTRATE’ run a MapReduce job?
No, illustrate will not pull any MapReduce, it will pull the internal data. On the console, illustrate will not do any job. It just shows the output of each stage and not the final output.
ILLUSTRATE operator is used to review how data is transformed through a sequence of Pig Latin statements. ILLUSTRATE command is your best friend when it comes to debugging a script. This command alone might be a good reason for choosing Pig over something else.
Syntax: illustrate relation_name;
15. What does illustrate do in Apache Pig?
Executing Pig scripts on large data sets, usually takes a long time. To tackle this, developers run Pig scripts on sample data, but there is possibility that the sample data selected, might not execute your Pig script properly. For instance, if the script has a join operator there should be at least a few records in the sample data that have the same key, otherwise the join operation will not return any results.
To tackle these kind of issues, illustrate is used. Illustrate takes a sample of the data and whenever it comes across operators like join or filter that remove data, it ensures that only some records pass through and some do not, by making modifications to the records such that they meet the condition. Illustrate just shows the output of each stage but does not run any MapReduce task.

16. List the relational operators in Pig.
All Pig Latin statements operate on relations (and operators are called relational operators). Different relational operators in Pig Latin are:
  • COGROUP: Joins two or more tables and then perform GROUP operation on the joined table result.
  • CROSS: CROSS operator is used to compute the cross product (Cartesian product) of two or more relations.
  • DISTINCT: Removes duplicate tuples in a relation.
  • FILTER: Select a set of tuples from a relation based on a condition.
  • FOREACH: Iterate the tuples of a relation, generating a data transformation.
  • GROUP: Group the data in one or more relations.
  • JOIN: Join two or more relations (inner or outer join).
  • LIMIT: Limit the number of output tuples.
  • LOAD: Load data from the file system.
  • ORDER: Sort a relation based on one or more fields.
  • SPLIT: Partition a relation into two or more relations.
  • STORE: Store data in the file system.
  • UNION: Merge the content of two relations. To perform a UNION operation on two relations, their columns and domains must be identical.
Tip: Go through this blog on relational operators, to understand them and see their implementations.
17. Is the keyword ‘DEFINE’ like a function name?
Yes, the keyword ‘DEFINE’ is like a function name.
DEFINE statement is used to assign a name (alias) to a UDF function or to a streaming command.
  • The function has a long package name that you don’t want to include in a script, especially if you call the function several times in that script. The constructor for the function takes string parameters. If you need to use different constructor parameters for different calls to the function you will need to create multiple defines – one for each parameter set.
  • The streaming command specification is complex. The streaming command specification requires additional parameters (input, output, and so on). So, assigning an alias makes it easier to access.
18. What is the function of co-group in Pig?
COGROUP takes members of different relations, binds them by similar fields, and creates a bag that contains a single instance of both relations where those relations have common fields. Co-group operation joins the data set by grouping one particular data set only.
It groups the elements by their common field and then returns a set of records containing two separate bags. The first bag consists of the first data set record with the common data set and the second bag consists of the second data set records with the common data set.
19. Can we say co-group is a group of more than 1 data set?
Co-group is a group of data sets. More than one data set, co-group will group all the data sets and join them based on the common field. Hence, we can say that co-group is a group of more than one data set and join of that data set as well.
20. The difference between GROUP and COGROUP operators in Pig?
Group and Cogroup operators are identical. For readability, GROUP is used in statements involving one relation and COGROUP is used in statements involving two or more relations. Group operator collects all records with the same key. Cogroup is a combination of group and join, it is a generalization of a group instead of collecting records of one input depends on a key, it collects records of n inputs based on a key. At a time, we can Cogroup up to 127 relations.
21. You have a file personal_data.txt in the HDFS directory with 100 records. You want to see only the first 5 records from the employee.txt file. How will you do this?
For getting only 5 records from 100 records we use limit operator.
First load the data in Pig:
personal_data = LOAD “/personal_data.txt” USING PigStorage(‘,’) as (parameter1, Parameter2, …);
Then Limit the data to 5 records:
limit_data = LIMIT personal_data 5;
22. What is a MapFile?
MapFile is a class which serves file-based map from keys to values.
A map is a directory containing two files, the data file, containing all keys and values in the map, and a smaller index file, containing a fraction of the keys. The fraction is determined by MapFile.Writer.getIndexInterval().
The index file is read entirely into memory. Thus, key implementations should try to keep themselves small. Map files are created by adding entries in-order.
23. What is BloomMapFile used for?
The BloomMapFile is a class that extends MapFile. So its functionality is similar to MapFile. BloomMapFile uses dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format.
24. What are the different execution modes available in Pig? 
The execution modes in Apache Pig are:
  • MapReduce Mode: This is the default mode, which requires access to a Hadoop cluster and HDFS installation. Since, this is a default mode, it is not necessary to specify -x flag (you can execute pig OR pig -x mapreduce). The input and output in this mode are present on HDFS.
  • Local Mode: With access to a single machine, all files are installed and run using a local host and file system. Here the local mode is specified using ‘-x flag’ (pig -x local). The input and output in this mode are present on local file system.
25. Is Pig script case sensitive?
Tip: Explain the both aspects of Apache Pig i.e. case-sensitive as well as case-insensitive aspect.
Pig script is both case sensitive and case insensitive.
User defined functions, the field name, and relations are case sensitive i.e. EMPLOYEE is not same as employee or M=LOAD ‘data’ is not same as M=LOAD ‘Data’.
Whereas Pig script keywords are case insensitive i.e. LOAD is same as load.
It is difficult to say whether Apache Pig is case sensitive or case insensitive. For instance, user defined functions, relations and field names in Pig are case sensitive. On the other hand, keywords in Apache Pig are case insensitive.
26. What does Flatten do in Pig?
Sometimes there is data in a tuple or a bag and if we want to remove the level of nesting from that data, then Flatten modifier in Pig can be used. Flatten un-nests bags and tuples. For tuples, the Flatten operator will substitute the fields of a tuple in place of a tuple, whereas un-nesting bags is a little complex because it requires creating new tuples.
27. What is Pig Statistics? What are all stats classes in the Java API package available?
Pig Statistics is a framework for collecting and storing script-level statistics for Pig Latin. Characteristics of Pig Latin scripts and the resulting MapReduce jobs are collected while the script is executed. These statistics are then available for Pig users and tools using Pig (such as Oozie) to retrieve after the job is completed.
The stats classes are in the package org.apache.pig.tools.pigstats:
  • PigStats
  • JobStats
  • OutputStats
  • InputStats.
28. What are the limitations of the Pig?
Limitations of the Apache Pig are:
  1. As the Pig platform is designed for ETL-type use cases, it’s not a better choice for real-time scenarios.
  2. Apache Pig is not a good choice for pinpointing a single record in huge data sets.
  3. Apache Pig is built on top of MapReduce, which is batch processing oriented.
  4.  
Hadoop Interview Questions – Apache Hive

Apache Hive – A Brief Introduction:

Apache Hive is a data warehouse system built on top of Hadoop and is used for analyzing structured and semi-structured data. It provides a mechanism to project structure onto the data and perform queries written in HQL (Hive Query Language) that are similar to SQL statements. Internally, these queries or HQL gets converted to map reduce jobs by the Hive compiler.

Apache Hive Job Trends:

Today, many companies consider Apache Hive as a de facto to perform analytics on large data sets. Also, since it supports SQL like query statements, it is very much popular among people who are from a non – programming background and wish to take advantage of Hadoop MapReduce framework.
Now, let us have a look at the rising Apache Hive job trends over the past few years:



The above image clearly shows the vast demand for Apache Hive professionals in the industry. Therefore, it is high time to prepare yourself and seize this very opportunity. 
I would suggest you to go through a dedicated blog on Apache Hive Tutorial to revise your concepts before proceeding in this Apache Hive Interview Questions blog.

 

Apache Hive Interview Questions

Here is the comprehensive list of the most frequently asked Apache Hive Interview Questions that have been framed after deep research and discussion with the industry experts.  

1. What kind of applications is supported by Apache Hive?

Hive supports all those client applications that are written in Java, PHP, Python, C++ or Ruby by exposing its Thrift server.

2. Define the difference between Hive and HBase?

The key differences between Apache Hive and HBase are as follows:
  • The Hive is a data warehousing infrastructure whereas HBase is a NoSQL database on top of Hadoop.
  • Apache Hive queries are executed as MapReduce jobs internally whereas HBase operations run in a real-time on its database rather than MapReduce. 
3. Where does the data of a Hive table gets stored?
By default, the Hive table is stored in an HDFS directory – /user/hive/warehouse. One can change it by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml. 

4. What is a metastore in Hive?

Metastore in Hive stores the meta data information using RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus which converts the object representation into relational schema and vice versa.

5. Why Hive does not store metadata information in HDFS?

Hive stores metadata information in the metastore using RDBMS instead of HDFS. The reason for choosing RDBMS is to achieve low latency as HDFS read/write operations are time consuming processes.

6. What is the difference between local and remote metastore?

Local Metastore:
In local metastore configuration, the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM, either on the same machine or on a remote machine.
Remote Metastore:
In the remote metastore configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM. Other processes communicate with the metastore server using Thrift Network APIs. You can have one or more metastore servers in this case to provide more availability.

7. What is the default database provided by Apache Hive for metastore?

By default, Hive provides an embedded Derby database instance backed by the local disk for the metastore. This is called the embedded metastore configuration.

8. Scenario: 

Suppose I have installed Apache Hive on top of my Hadoop cluster using default metastore configuration. Then, what will happen if we have multiple clients trying to access Hive at the same time? 
The default metastore configuration allows only one Hive session to be opened at a time for accessing the metastore. Therefore, if multiple clients try to access the metastore at the same time, they will get an error. One has to use a standalone metastore, i.e. Local or remote metastore configuration in Apache Hive for allowing access to multiple clients concurrently. 
Following are the steps to configure MySQL database as the local metastore in Apache Hive:
  • One should make the following changes in hive-site.xml:
    • javax.jdo.option.ConnectionURL property should be set to jdbc:mysql://host/dbname?createDataba
      seIfNotExist=true.
    • javax.jdo.option.ConnectionDriverName property should be set to com.mysql.jdbc.Driver.
    • One should also set the username and password as:
      • javax.jdo.option.ConnectionUserName is set to desired username.
      • javax.jdo.option.ConnectionPassword is set to the desired password.
  • The JDBC driver JAR file for MySQL must be on the Hive’s classpath, i.e. The jar file should be copied into the Hive’s lib directory.
  • Now, after restarting the Hive shell, it will automatically connect to the MySQL database which is running as a standalone metastore.

9. What is the difference between external table and managed table?

Here is the key difference between an external table and managed table:
  • In case of managed table, If one drops a managed table, the metadata information along with the table data is deleted from the Hive warehouse directory.
  • On the contrary, in case of an external table, Hive just deletes the metadata information regarding the table and leaves the table data present in HDFS untouched. 
Note: I would suggest you to go through the blog on Hive Tutorial to learn more about Managed Table and External Table in Hive.

10. Is it possible to change the default location of a managed table? 

Yes, it is possible to change the default location of a managed table. It can be achieved by using the clause – LOCATION ‘<hdfs_path>’.

11. When should we use SORT BY instead of ORDER BY ?

We should use SORT BY instead of ORDER BY when we have to sort huge datasets because SORT BY clause sorts the data using multiple reducers whereas ORDER BY sorts all of the data together using a single reducer. Therefore, using ORDER BY against a large number of inputs will take a lot of time to execute. 

12. What is a partition in Hive?

Hive organizes tables into partitions for grouping similar type of data together based on a column or partition key. Each Table can have one or more partition keys to identify a particular partition. Physically, a partition is nothing but a sub-directory in the table directory.

13. Why do we perform partitioning in Hive?

Partitioning provides granularity in a Hive table and therefore, reduces the query latency by scanning only relevant partitioned data instead of the whole data set.
For example, we can partition a transaction log of an e – commerce website based on month like Jan, February, etc. So, any analytics regarding a particular month, say Jan, will have to scan the Jan partition (sub – directory) only instead of the whole table data.

14. What is dynamic partitioning and when is it used?

In dynamic partitioning values for partition columns are known in the runtime, i.e. It is known during loading of the data into a Hive table. 
One may use dynamic partition in following two cases:
  • Loading data from an existing non-partitioned table to improve the sampling and therefore, decrease the query latency. 
  • When one does not know all the values of the partitions before hand and therefore, finding these partition values manually from a huge data sets is a tedious task. 

15. Scenario:

Suppose, I create a table that contains details of all the transactions done by the customers of year 2016: CREATE TABLE transaction_details (cust_id INT, amount FLOAT, month STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
Now, after inserting 50,000 tuples in this table, I want to know the total revenue generated for each month. But, Hive is taking too much time in processing this query. How will you solve this problem and list the steps that I will be taking in order to do so?
We can solve this problem of query latency by partitioning the table according to each month. So, for each month we will be scanning only the partitioned data instead of whole data sets.
As we know, we can’t partition an existing non-partitioned table directly. So, we will be taking following steps to solve the very problem: 
  1. Create a partitioned table, say partitioned_transaction:
CREATE TABLE partitioned_transaction (cust_id INT, amount FLOAT, country STRING) PARTITIONED BY (month STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ; 
2. Enable dynamic partitioning in Hive:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
3. Transfer the data from the non – partitioned table into the newly created partitioned table:
INSERT OVERWRITE TABLE partitioned_transaction PARTITION (month) SELECT cust_id, amount, country, month FROM transaction_details;
Now, we can perform the query using each partition and therefore, decrease the query time. 

16. How can you add a new partition for the month December in the above partitioned table?

For adding a new partition in the above table partitioned_transaction, we will issue the command give below:
ALTER TABLE partitioned_transaction ADD PARTITION (month=’Dec’) LOCATION  ‘/partitioned_transaction’;
Note: I suggest you to go through the dedicated blog on Hive Commands where all the commands present in Apache Hive have been explained with an example. 

17. What is the default maximum dynamic partition that can be created by a mapper/reducer? How can you change it?

By default the number of maximum partition that can be created by a mapper or reducer is set to 100. One can change it by issuing the following command:
SET hive.exec.max.dynamic.partitions.pernode = <value> 
Note: You can set the total number of dynamic partitions that can be created by one statement by using: SET hive.exec.max.dynamic.partitions = <value>

18. Scenario: 

I am inserting data into a table based on partitions dynamically. But, I received an error – FAILED ERROR IN SEMANTIC ANALYSIS: Dynamic partition strict mode requires at least one static partition column. How will you remove this error?
To remove this error one has to execute following commands:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Things to Remember:
  • By default, hive.exec.dynamic.partition configuration property is set to False in case you are using Hive whose version is prior to 0.9.0. 
  • hive.exec.dynamic.partition.mode is set to strict by default. Only in non – strict mode Hive allows all partitions to be dynamic.

19. Why do we need buckets?

There are two main reasons for performing bucketing to a partition:
  • A map side join requires the data belonging to a unique join key to be present in the same partition. But what about those cases where your partition key differs from that of join key? Therefore, in these cases you can perform a map side join by bucketing the table using the join key.
  • Bucketing makes the sampling process more efficient and therefore, allows us to decrease the query time.

20. How Hive distributes the rows into buckets?

Hive determines the bucket number for a row by using the formula: hash_function (bucketing_column) modulo (num_of_buckets). Here, hash_function depends on the column data type. For integer data type, the hash_function will be: 
hash_function (int_type_column)= value of int_type_column

21. What will happen in case you have not issued the command:  ‘SET hive.enforce.bucketing=true;’ before bucketing a table in Hive in Apache Hive 0.x or 1.x?

The command:  ‘SET hive.enforce.bucketing=true;’ allows one to have the correct number of reducer while using ‘CLUSTER BY’ clause for bucketing a column. In case it’s not done, one may find the number of files that will be generated in the table directory to be not equal to the number of buckets. As an alternative, one may also set the number of reducer equal to the number of buckets by using set mapred.reduce.task = num_bucket.

22. What is indexing and why do we need it?

One of the Hive query optimization methods is Hive index. Hive index is used to speed up the access of a column or set of columns in a Hive database because with the use of index the database system does not need to read all rows in the table to find the data that one has selected.

23. Scenario:

Suppose, I have a CSV file – ‘sample.csv’ present in ‘/temp’ directory with the following entries:
id first_name last_name email gender ip_address
1 Hugh Jackman hughjackman@cam.ac.uk Male 136.90.241.52
2 David Lawrence dlawrence1@gmail.com Male 101.177.15.130
3 Andy Hall andyhall2@yahoo.com Female 114.123.153.64
4 Samuel Jackson samjackson231@sun.com Male 89.60.227.31
5 Emily Rose rose.emily4@surveymonkey.com Female 119.92.21.19
How will you consume this CSV file into the Hive warehouse using built SerDe?
SerDe stands for serializer/deserializer. A SerDe allows us to convert the unstructured bytes into a record that we can process using Hive. SerDes are implemented using Java. Hive comes with several built-in SerDes and many other third-party SerDes are also available. 
Hive provides a specific SerDe for working with CSV files. We can use this SerDe for the sample.csv by issuing following commands:
CREATE EXTERNAL TABLE sample
(id intfirst_name string
last_name stringemail string,
gender stringip_address string
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’ 
STORED AS TEXTFILE LOCATION ‘/temp’;
Now, we can perform any query on the table ‘sample’:
SELECT first_name FROM sample WHERE gender = ‘male’;

24. Scenario:

Suppose, I have a lot of small CSV files present in /input directory in HDFS and I want to create a single Hive table corresponding to these files. The data in these files are in the format: {id, name, e-mail, country}. Now, as we know, Hadoop performance degrades when we use lots of small files.
So, how will you solve this problem where we want to create a single Hive table for lots of small files without degrading the performance of the system?
One can use the SequenceFile format which will group these small files together to form a single sequence file. The steps that will be followed in doing so are as follows:
  • Create a temporary table:
CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING)
ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS TEXTFILE;
  • Load the data into temp_table:
LOAD DATA INPATH ‘/input’ INTO TABLE temp_table;
  • Create a table that will store data in SequenceFile format:
CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING, country STRING)
ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS SEQUENCEFILE;
  • Transfer the data from the temporary table into the sample_seqfile table:
INSERT OVERWRITE TABLE sample SELECT * FROM temp_table;
Hence, a single SequenceFile is generated which contains the data present in all of the input files and therefore, the problem of having lots of small files is finally eliminated. 

 

 

 

Apache HBase Interview Questions

Important points to remember about Apache HBase:

  • Apache HBase is a NoSQL column oriented database which is used to store the sparse data sets. It runs on top of the Hadoop distributed file system (HDFS) and it can store any kind of data.
  • Clients can access HBase data through either a native Java API, or through a Thrift or REST gateway, making it accessible from any language.
Tip: Before going through this Apache HBase interview questions, I would suggest you to go through Apache HBase Tutorial and HBase Architecture to revise your HBase concepts.
Now moving on, let us look at the Apache HBase interview questions.

1. What are the key components of HBase?

The key components of HBase are Zookeeper, RegionServer and HBase Master. 
  • Region Server: A table can be divided into several regions. A group of regions is served to the clients by a Region Server.
  • HMaster: It coordinates and manages the Region Servers (similar as NameNode manages DataNodes in HDFS).
  • ZooKeeper: Zookeeper acts like as a coordinator inside HBase distributed environment. It helps in maintaining server state inside the cluster by communicating through sessions.

2. When would you use HBase?

HBase is used in cases where we need random read and write operations and it can perform a number of operations per second on a large data sets. HBase gives strong data consistency. It can handle very large tables with billions of rows and millions of columns on top of commodity hardware cluster.

3. What is the use of get() method?

get() method is used to read the data from the table.

4. Define the difference between Hive and HBase?

Apache Hive is a data warehousing infrastructure built on top of Hadoop. It helps in querying data stored in HDFS for analysis using Hive Query Language (HQL), which is a SQL-like language, that gets translated into MapReduce jobs. Hive performs batch processing on Hadoop.
Apache HBase is NoSQL key/value store which runs on top of HDFS. Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs. HBase partitions the tables, and the tables are further splitted into column families. 
Hive and HBase are two different Hadoop based technologies – Hive is an SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database of Hadoop. We can use them together. Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from HBase to Hive and vice-versa.

5. Explain the data model of HBase.

HBase consists of:
  • Set of tables.
  • Each table contains column families and rows.
  • Row key acts as a Primary key in HBase.
  • Any access to HBase tables uses this Primary Key.
  • Each column qualifier present in HBase denotes attributes corresponding to the object which resides in the cell.

6. Define column families?

Column Family is a collection of columns, whereas row is a collection of column families.

7. Define standalone mode in HBase?

It is a default mode of HBase. In standalone mode, HBase does not use HDFS—it uses the local filesystem instead—and it runs all HBase daemons and a local ZooKeeper in the same JVM process.

8. What is decorating Filters?

It is useful to modify, or extend, the behavior of a filter to gain additional control over the returned data. These types of filters are known as decorating filter. It includes SkipFilter and WhileMatchFilter.

9. What is RegionServer?

A table can be divided into several regions. A group of regions is served to the clients by a Region Server.

10. What are the data manipulation commands of HBase?

Data Manipulation commands of HBase are:
  • put – Puts a cell value at a specified column in a specified row in a particular table.
  • get – Fetches the contents of a row or a cell.
  • delete – Deletes a cell value in a table.
  • deleteall – Deletes all the cells in a given row.
  • scan – Scans and returns the table data.
  • count – Counts and returns the number of rows in a table.
  • truncate – Disables, drops, and recreates a specified table.

11. Which code is used to open a connection in HBase?

Following code is used to open a HBase connection, here users is my HBase table:
1
2
Configuration myConf = HBaseConfiguration.create();
HTable table = new HTable(myConf, “users”);
12. What is the use of truncate command?
It is used to disable, drop and recreate the specified tables.
Tip: To delete table first disable it, then delete it.

13. Explain deletion in HBase? Mention what are the three types of tombstone markers in HBase?

When you delete the cell in HBase, the data is not actually deleted, but a tombstone marker is set, making the deleted cells invisible.  HBase deleted are actually removed during major compaction.
Three types of tombstone markers are there:
  • Version delete marker: For deletion, it marks a single version of a column
  • Column delete marker: For deletion, it marks all the versions of a column
  • Family delete marker: For deletion, it marks of all columns for a column family

14. Explain how does HBase actually delete a row?

In HBase, whatever you write will be stored from RAM to disk, these disk writes are immutable barring compaction. During the deletion process in HBase, the major compaction process deletes the marker while minor compaction don’t. In normal deletes, it will just a delete the tombstone marker- these deleted data will be removed during compaction.
Also, if you delete data and add more data, but with an earlier timestamp than the tombstone timestamp, further get() functions may be masked by the delete/tombstone marker and hence you will not receive the inserted value until the major compaction.

15. Explain what happens if you alter the block size of a column family on an already occupied database?

When you alter the block size of the column family, the new data occupies the new block size while the old data remains within the old block size. During data compaction, the old data will take the new block size.  New files as they are dumped, have a new block size, whereas existing data will continue to be read correctly. All data should be transformed to the new block size, after the next major compaction.

16. HBase blocksize is configured on which level?

The blocksize is configured per column family and the default value is 64 KB. This value can be changed as per requirements.

17. Which command is used to run HBase Shell?

./bin/hbase shell command is used to run the HBase shell. Execute this command in HBase directory.

18. Which command is used to show the current HBase user?

whoami command is used to show HBase user.

19. What is the full form of MSLAB?

MSLAB stands for Memstore-Local Allocation Buffer. Whenever a request thread needs to insert data into a MemStore, it doesn’t allocates the space for that data from the heap at large, but rather allocates memory arena dedicated to the target region.

20. Define LZO?

Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm that focuses on decompression speed.

21. What is HBase Fsck?

HBase comes with a tool called hbck which is implemented by the HBaseFsck class. HBaseFsck (hbck) is a tool for checking for region consistency and table integrity problems and repairing a corrupted HBase. It works in two basic modes – a read-only inconsistency identifying mode and a multi-phase read-write repair mode.

22. What is REST?

Rest stands for Representational State Transfer which defines the semantics so that the protocol can be used in a generic way to address remote resources. It also provides support for different message formats, offering many choices for a client application to communicate with the server.

23. What is Thrift?

Apache Thrift is written in C++, but provides schema compilers for many programming languages, including Java, C++, Perl, PHP, Python, Ruby, and more.

24. What is Nagios?

Nagios is a very commonly used support tool for gaining qualitative data regarding cluster status. It polls current metrics on a regular basis and compares them with given thresholds.

25. What is the use of ZooKeeper?

The ZooKeeper is used to maintain the configuration information and communication between region servers and clients. It also provides distributed synchronization. It helps in maintaining server state inside the cluster by communicating through sessions.
Every Region Server along with HMaster Server sends continuous heartbeat at regular interval to Zookeeper and it checks which server is alive and available. It also provides server failure notifications so that, recovery measures can be executed.

26. Define catalog tables in HBase?

Catalog tables are used to maintain the metadata information.

27. Define compaction in HBase?

HBase combines HFiles to reduce the storage and reduce the number of disk seeks needed for a read. This process is called compaction. Compaction chooses some HFiles from a region and combines them. There are two types of compactions.
  • Minor Compaction: HBase automatically picks smaller HFiles and recommits them to bigger HFiles.
  • Major Compaction: In Major compaction, HBase merges and recommits the smaller HFiles of a region to a new HFile.

28. What is the use of HColumnDescriptor class?

HColumnDescriptor stores the information about a column family like compression settings, number of versions etc. It is used as input when creating a table or adding a column.

29. Which filter accepts the pagesize as the parameter in hBase?

PageFilter accepts the pagesize as the parameter. Implementation of Filter interface that limits results to a specific page size. It terminates scanning once the number of filter-passed the rows greater than the given page size.
Syntax: PageFilter (<page_size>)

30. How will you design or modify schema in HBase programmatically?

HBase schemas can be created or updated using the Apache HBase Shell or by using Admin in the Java API.
Creating table schema:
1
2
3
4
5
6
7
8
9
10
11
12
Configuration config = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf); // execute command through admin</span></pre>

// Instantiating table descriptor class
HTableDescriptor t1 = new HTableDescriptor(TableName.valueOf("employee"));

// Adding column families to t1
t1.addFamily(new HColumnDescriptor("professional"));
t1.addFamily(new HColumnDescriptor("personal"));

// Create the table through admin
admin.createTable(t1);
Tip: Tables must be disabled when making ColumnFamily modifications.
For modification:
1
2
3
4
String table = “myTable”;
admin.disableTable(table);
admin.modifyColumn(table, cf2); // modifying existing ColumnFamily
admin.enableTable(table);

31.What are the filters are available in Apache HBase?

The filters that are supported by HBase are:
  • ColumnPrefixFilter: takes a single argument, a column prefix. It returns only those key-values present in a column that starts with the specified column prefix.
  • TimestampsFilter: takes a list of timestamps. It returns those key-values whose timestamps match any of the specified timestamps.
  • PageFilter: takes one argument, a page size. It returns page size, number of rows from the table.
  • MultipleColumnPrefixFilter: takes a list of column prefixes. It returns key-values that are present in a column that starts with any of the specified column prefixes.
  • ColumnPaginationFilter: takes two arguments, a limit and an offset. It returns limit number of columns after offset number of columns. It does this for all the rows.
  • SingleColumnValueFilter: takes a column family, a qualifier, a comparison operator and a comparator. If the specified column is not found, all the columns of that row will be emitted. If the column is found and the comparison with the comparator returns true, all the columns of the row will be emitted.
  • RowFilter: takes a comparison operator and a comparator. It compares each row key with the comparator using the comparison operator and if the comparison returns true, it returns all the key-values in that row.
  • QualifierFilter: takes a comparison operator and a comparator. It compares each qualifier name with the comparator using the comparison operator and if the comparison returns true, it returns all the key-values in that column.
  • ColumnRangeFilter: takes either minColumn, maxColumn, or both. Returns only those keys with columns that are between minColumn and maxColumn. It also takes two boolean variables to indicate whether to include the minColumn and maxColumn or not. If you don’t want to set the minColumn or the maxColumn, you can pass in an empty argument.
  • ValueFilter: takes a comparison operator and a comparator. It compares each value with the comparator using the compare operator and if the comparison returns true, it returns that key-value.
  • PrefixFilter: takes a single argument, a prefix of a row key. It returns only those key-values present in a row that start with the specified row prefix.
  • SingleColumnValueExcludeFilter: takes the same arguments and behaves same as SingleColumnValueFilter. However, if the column is found and the condition passes, all the columns of the row will be omitted except for the tested column value.
  • ColumnCountGetFilter: takes one argument, a limit. It returns the first limit number of columns in the table.
  • InclusiveStopFilter: takes one argument, a row key on which to stop scanning. It returns all key-values present in rows up to and including the specified row.
  • DependentColumnFilter: takes two arguments required arguments, a family and a qualifier. It tries to locate this column in each row and returns all key-values in that row that have the same timestamp.
  • FirstKeyOnlyFilter: takes no arguments. Returns the key portion of the first key-value pair.
  • KeyOnlyFilter: takes no arguments. Returns the key portion of each key-value pair.
  • FamilyFilter: takes a comparison operator and comparator. It compares each family name with the comparator using the comparison operator and if the comparison returns true, it returns all the key-values in that family.
  • CustomFilter: You can create a custom filter by implementing the Filter class.

32. How do we back up a HBase cluster?

There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster. Each approach has benefits and limitation.
Full Shutdown Backup
Some environments can tolerate a periodic full shutdown of their HBase cluster, for example, if it is being used as a back-end process and not serving front-end webpages.
  • Stop HBase: Stop the HBase services first.
  • Distcp: Distcp could be used to either copy the contents of the HBase directory in HDFS to either the same cluster in another directory, or to a different cluster.
  • Restore: The backup of the HBase directory from HDFS is copied onto the ‘real’ HBase directory via distcp. The act of copying these files, creates new HDFS metadata, which is why a restore of the NameNode edits from the time of the HBase backup isn’t required for this kind of restore, because it’s a restore (via distcp) of a specific HDFS directory (i.e., the HBase part) not the entire HDFS file-system.
Live Cluster Backup
The environments which cannot handle downtime uses Live Cluster Backup.
  • CopyTable: Copy table utility could either be used to copy data from one table to another on the same cluster, or to copy data to another table on another cluster.
  • Export: Export approach dumps the content of a table to HDFS on the same cluster.

33. How HBase Handles the write failure?

Failures are common in large distributed systems, and HBase is no exception.
If the server hosting a MemStore that has not yet been flushed crashes. The data that was in memory, but not yet persisted are lost. HBase safeguards against that by writing to the WAL before the write completes. Every server that’s part of the.
HBase cluster keeps a WAL to record changes as they happen. The WAL is a file on the underlying file system. A write isn’t considered successful until the new WAL entry is successfully written. This guarantee makes HBase as durable as the file system backing it. Most of the time, HBase is backed by the Hadoop Distributed Filesystem (HDFS). If HBase goes down, the data that were not yet flushed from the MemStore to the HFile can be recovered by replaying the WAL.

34. While reading data from HBase, from which three places data will be reconciled before returning the value?

The read process will go through the following process sequentially:
  • For reading the data, the scanner first looks for the Row cell in Block cache. Here all the recently read key value pairs are stored.
  • If Scanner fails to find the required result, it moves to the MemStore, as we know this is the write cache memory. There, it searches for the most recently written files, which has not been dumped yet in HFile.
  • At last, it will use bloom filters and block cache to load the data from the HFile.

35. Can you explain data versioning?

In addition to being a schema-less database, HBase is also versioned.
Every time you perform an operation on a cell, HBase implicitly stores a new version. Creating, modifying and deleting a cell are all treated identically, they are all new versions. When a cell exceeds the maximum number of versions, the extra records are dropped during the major compaction.
Instead of deleting an entire cell, you can operate on a specific version within that cell. Values within a cell are versioned and it is identified the timestamp. If a version is not mentioned, then the current timestamp is used to retrieve the version. The default number of cell version is three.

36. What is a Bloom filter and how does it help in searching rows?

HBase supports Bloom Filter to improve the overall throughput of the cluster. A HBase Bloom Filter  is a space efficient mechanism to test whether a HFile contains a specific row or row-col cell.
Without Bloom Filter, the only way to decide if a row key is present in a HFile  is to check the HFile’s block index, which stores the start row key of each block in the HFile. There are many rows drops between the two start keys. So, HBase has to load the block and scan the block’s keys to figure out if that row key actually exists.

Top 50 Hadoop Interview Questions for 2017

1. Explain “Big Data” and what are five V’s of Big Data?

“Big data” is the term for a collection of large and complex data sets, that makes it difficult to process using relational database management tools or traditional data processing applications. It is difficult to capture, curate, store, search, share, transfer, analyze, and visualize Big data. Big Data has emerged as an opportunity for companies. Now they can successfully derive value from their data and will have a distinct advantage over their competitors with enhanced business decisions making capabilities.
Tip: It will be a good idea to talk about the 5Vs in such questions, whether it is asked specifically or not!
Volume: The volume represents the amount of data which is growing at an exponential rate i.e. in Petabytes and Exabytes. 
Velocity: Velocity refers to the rate at which data is growing, which is very fast. Today, yesterday’s data are considered as old data. Nowadays, social media is a major contributor in the velocity of growing data.
Variety: Variety refers to the heterogeneity of data types. In another word, the data which are gathered has a variety of formats like videos, audios, csv, etc. So, these various formats represent the variety of data.
Veracity: Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness. Data available can sometimes get messy and maybe difficult to trust. With many forms of big data, quality and accuracy are difficult to control. The volume is often the reason behind for the lack of quality and accuracy in the data.
Value: It is all well and good to have access to big data but unless we can turn it into a value it is useless. By turning it into value I mean, Is it adding to the benefits of the organizations? Is the organization working on Big Data achieving high ROI (Return On Investment)? Unless, it adds to their profits by working on Big Data, it is useless.
As we know Big Data is growing at an accelerating rate, so the factors associated with it are also evolving. To go through them and understand it in detail, I recommend you to go through Big Data Tutorial blog.

2. What is Hadoop and its components. 

When “Big Data” emerged as a problem, Apache Hadoop evolved as a solution to it. Apache Hadoop is a framework which provides us various services or tools to store and process Big Data. It helps in analyzing Big Data and making business decisions out of it, which can’t be done efficiently and effectively using traditional systems.
Tip: Now, while explaining Hadoop, you should also explain the main components of Hadoop, i.e.:
Storage unit– HDFS (NameNode, DataNode)
Processing framework– YARN (ResourceManager, NodeManager)

3. What are HDFS and YARN?

HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave topology.
Tip: It is recommended to explain the HDFS components too i.e.
NameNode: NameNode is the master node in the distributed environment and it maintains the metadata information for the blocks of data stored in HDFS like block location, replication factors etc.
DataNode: DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes.
YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.
Tip: Similarly, as we did in HDFS, we should also explain the two components of YARN:   
ResourceManager: It receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs.
NodeManager: NodeManager is installed on every DataNode and it is responsible for execution of the task on every single DataNode.
If you want to learn in detail about HDFS & YARN go through Hadoop Tutorial blog.

4. Tell me about the various Hadoop daemons and their roles in a Hadoop cluster.

Generally approach this question by first explaining the HDFS daemons i.e. NameNode, DataNode and Secondary NameNode, and then moving on to the YARN daemons i.e. ResorceManager and NodeManager, and lastly explaining the JobHistoryServer.
NameNode: It is the master node which is responsible for storing the metadata of all the files and directories. It has information about blocks, that make a file, and where those blocks are located in the cluster.
Datanode: It is the slave node that contains the actual data.
Secondary NameNode: It periodically merges the changes (edit log) with the FsImage (Filesystem Image), present in the NameNode. It stores the modified FsImage into persistent storage, which can be used in case of failure of NameNode.
ResourceManager: It is the central authority that manages resources and schedule applications running on top of YARN.
NodeManager: It runs on slave machines, and is responsible for launching the application’s containers (where applications execute their part), monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.
JobHistoryServer: It maintains information about MapReduce jobs after the Application Master terminates.

Hadoop HDFS

5. Compare HDFS with Network Attached Storage (NAS).

In this question, first explain NAS and HDFS, and then compare their features as follows:
  • Network-attached storage (NAS) is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients. NAS can either be a hardware or software which provides services for storing and accessing files. Whereas Hadoop Distributed File System (HDFS) is a distributed filesystem to store data using commodity hardware.
  • In HDFS Data Blocks are distributed across all the machines in a cluster. Whereas in NAS data is stored on a dedicated hardware.
  • HDFS is designed to work with MapReduce paradigm, where computation is moved to the data. NAS is not suitable for MapReduce since data is stored separately from the computations.
  • HDFS uses commodity hardware which is cost effective, whereas a NAS is a high-end storage devices which includes high cost.

6. What are the basic differences between relational database and HDFS?

  • Here are the key differences between HDFS and relational database:
RDBMS
Hadoop
Data Types
RDBMS relies on the structured data and the schema of the data is always known.
Any kind of data can be stored into Hadoop i.e. Be it structured, unstructured or semi-structured.
Processing
RDBMS provides limited or no processing capabilities.
Hadoop allows us to process the data which is distributed across the cluster in a parallel fashion.
Schema on Read Vs. Write
RDBMS is based on ‘schema on write’ where schema validation is done before loading the data.
On the contrary, Hadoop follows the schema on read policy.
Read/Write Speed
In RDBMS, reads are fast because the schema of the data is already known.
The writes are fast in HDFS because no schema validation happens during HDFS write.
Cost
Licensed software, therefore, I have to pay for the software.
Hadoop is an open source framework. So, I don’t need to pay for the software.
Best Fit Use Case
RDBMS is used for OLTP (Online Trasanctional Processing) system.
Hadoop is used for Data discovery, data analytics or OLAP system.

7. List the difference between Hadoop 1 and Hadoop 2.

This is an important question and while answering this question, we have to mainly focus on two points i.e. Passive NameNode and YARN architecture.
In Hadoop 1.x, “NameNode” is the single point of failure. In Hadoop 2.x, we have Active and Passive “NameNodes”. If the active “NameNode” fails, the passive “NameNode” takes charge. Because of this, high availability can be achieved in Hadoop 2.x.
Also, in Hadoop 2.x, YARN provides a central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource. MRV2 is a particular type of distributed application that runs the MapReduce framework on top of YARN. Other tools can also perform data processing via YARN, which was a problem in Hadoop 1.x.

8. What are active and passive “NameNodes”? 

In HA (High Availability) architecture, we have two NameNodes – Active “NameNode” and Passive “NameNode”.
  • Active “NameNode” is the “NameNode” which works and runs in the cluster.
  • Passive “NameNode” is a standby “NameNode”, which has similar data as active “NameNode”.
When the active “NameNode” fails, the passive “NameNode” replaces the active “NameNode” in the cluster. Hence, the cluster is never without a “NameNode” and so it never fails.

9. Why does one remove or add nodes in a Hadoop cluster frequently?

One of the most attractive features of the Hadoop framework is its utilization of commodity hardware. However, this leads to frequent “DataNode” crashes in a Hadoop cluster. Another striking feature of Hadoop Framework is the ease of scale in accordance with the rapid growth in data volume. Because of these two reasons, one of the most common task of a Hadoop administrator is to commission (Add) and decommission (Remove) “Data Nodes” in a Hadoop Cluster.
Read this blog to get a detailed understanding on commissioning and decommissioning nodes in a Hadoop cluster.

10. What happens when two clients try to access the same file in the HDFS?

HDFS supports exclusive writes only.
When the first client contacts the “NameNode” to open the file for writing, the “NameNode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “NameNode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client.

11. How does NameNode tackle DataNode failures?

NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly.
A block report contains a list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead.
The NameNode replicates the blocks of dead node to another DataNode using the replicas created earlier.

12. What will you do when NameNode is down?

The NameNode recovery process involves the following steps to make the Hadoop cluster up and running:
1. Use the file system metadata replica (FsImage) to start a new NameNode.
2. Then, configure the DataNodes and clients so that they can acknowledge this new NameNode, that is started.
3. Now the new NameNode will start serving the client after it has completed loading the last checkpoint FsImage (for metadata information) and received enough block reports from the DataNodes. 
Whereas, on large Hadoop clusters this NameNode recovery process may consume a lot of time and this becomes even a greater challenge in the case of the routine maintenance. Therefore, we have HDFS High Availability Architecture which is covered in the HA architecture blog.

13. What is a checkpoint?

In brief, “Checkpointing” is a process that takes an FsImage, edit log and compacts them into a new FsImage. Thus, instead of replaying an edit log, the NameNode can load the final in-memory state directly from the FsImage. This is a far more efficient operation and reduces NameNode startup time. Checkpointing is performed by Secondary NameNode. 

14. How is HDFS fault tolerant? 

When data is stored over HDFS, NameNode replicates the data to several DataNode. The default replication factor is 3. You can change the configuration factor as per your need. If a DataNode goes down, the NameNode will automatically copy the data to another node from the replicas and make the data available. This provides fault tolerance in HDFS.

15. Can NameNode and DataNode be a commodity hardware? 

The smart answer to this question would be, DataNodes are commodity hardware like personal computers and laptops as it stores data and are required in a large number. But from your experience you can tell that, NameNode is the master node and it stores metadata about all the blocks stored in HDFS. It requires high memory (RAM) space, so NameNode needs to be a high-end machine with good memory space.

16. Why do we use HDFS for applications having large data sets and not when there are a lot of small files? 

HDFS is more suitable for large amounts of data sets in a single file as compared to small amount of data spread across multiple files. As you know, the NameNode stores the metadata information regarding file system in the RAM. Therefore, the amount of memory produces a limit to the number of files in my HDFS file system. In other words, too much of files will lead to generation of too much meta data. And, storing these meta data in the RAM will become a challenge. As a thumb rule, metadata for a file, block or directory takes 150 bytes. 

17. How do you define “block” in HDFS? What is the default block size in Hadoop 1 and in Hadoop 2? Can it be changed? 

Blocks are the nothing but the smallest continuous location on your hard drive where data is stored. HDFS stores each as blocks, and distribute it across the Hadoop cluster. Files in HDFS are broken down into block-sized chunks, which are stored as independent units.
Hadoop 1 default block size: 64 MB
Hadoop 2 default block size:  128 MB
Yes, blocks can be configured. The dfs.block.size parameter can be used in the hdfs-site.xml file to set the size of a block in a Hadoop environment.

18. What does ‘jps’ command do? 

The ‘jps’ command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager etc. that are running on the machine.

19. How do you define “Rack Awareness” in Hadoop?

Rack Awareness is the algorithm in which the “NameNode” decides how blocks and their replicas are placed, based on rack definitions to minimize network traffic between “DataNodes” within the same rack. Let’s say we consider replication factor 3 (default), the policy is that “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is known as the “Replica Placement Policy”.
To know rack awareness in more detail, refer to the HDFS architecture blog. 

20. What is “speculative execution” in Hadoop? 

If a node appears to be executing a task slower, the master node can redundantly execute another instance of the same task on another node. Then, the task which finishes first will be accepted and the other one is killed. This process is called “speculative execution”.

21. How can I restart “NameNode” or all the daemons in Hadoop? 

This question can have two answers, we will discuss both the answers. We can restart NameNode by following methods:
  1. You can stop the NameNode individually using. /sbin /hadoop-daemon.sh stop namenode command and then start the NameNode using. /sbin/hadoop-daemon.sh start namenode command.
  2. To stop and start all the daemons, use. /sbin/stop-all.sh and then use ./sbin/start-all.sh command which will stop all the daemons first and then start all the daemons.
These script files reside in the sbin directory inside the Hadoop directory.

22. What is the difference between an “HDFS Block” and an “Input Split”?

The “HDFS Block” is the physical division of the data while “Input Split” is the logical division of the data. HDFS divides data in blocks for storing the blocks together, whereas for processing, MapReduce divides the data into the input split and assign it to mapper function.

23. Name the three modes in which Hadoop can run.

The three modes in which Hadoop can run are as follows:
  1. Standalone (local) mode: This is the default mode if we don’t configure anything. In this mode, all the components of Hadoop, such NameNode, DataNode, ResourceManager, and NodeManager, run as a single Java process. This uses local filesystem.
  2. Pseudo distributed mode: A single-node Hadoop deployment is considered as running Hadoop system in pseudo-distributed mode. In this mode, all the Hadoop services, including both the master and the slave services, were executed on a single compute node.
  3. Fully distributed mode: A Hadoop deployments in which the Hadoop master and slave services run on separate nodes, are stated as fully distributed mode.https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif

Hadoop MapReduce

24. What is “MapReduce”? What is the syntax to run a “MapReduce” program? 

It is a framework/a programming model that is used for processing large data sets over a cluster of computers using parallel programming. The syntax to run a MapReduce program is hadoop_jar_file.jar /input_path /output_path.
If you have any doubt in MapReduce or want to revise your concepts you can refer this MapReduce tutorial.

25. What are the main configuration parameters in a “MapReduce” program?

The main configuration parameters which users need to specify in “MapReduce” framework are:
  • Job’s input locations in the distributed file system
  • Job’s output location in the distributed file system
  • Input format of data
  • Output format of data
  • Class containing the map function
  • Class containing the reduce function
  • JAR file containing the mapper, reducer and driver classes

26. State the reason why we can’t perform “aggregation” (addition) in mapper? Why do we need the “reducer” for this?

This answer includes many points, so we will go through them sequentially.
  • We cannot perform “aggregation” (addition) in mapper because sorting does not occur in the “mapper” function. Sorting occurs only on the reducer side and without sorting aggregation cannot be done.
  • During “aggregation”, we need output of all the mapper functions which may not be possible to collect in the map phase as mappers may be running on different machine where the data blocks are stored.
  • And lastly, if we try to aggregate data at mapper, it requires communication between all mapper functions which may be running on different machines. So, it will consume high network bandwidth and can cause network bottlenecking.

27. What is the purpose of “RecordReader” in Hadoop?

The “InputSplit” defines a slice of work, but does not describe how to access it. The “RecordReader” class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper” task. The “RecordReader” instance is defined by the “Input Format”.

28. Explain “Distributed Cache” in a “MapReduce Framework”.

Distributed Cache can be explained as, a facility provided by the MapReduce framework to cache files needed by applications. Once you have cached a file for your job, Hadoop framework will make it available on each and every data nodes where you map/reduce tasks are running. Then you can access the cache file as a local file in your Mapper or Reducer job.

29. How do “reducers” communicate with each other? 

This is a tricky question. The “MapReduce” programming model does not allow “reducers” to communicate with each other. “Reducers” run in isolation.
30. What does a “MapReduce Partitioner” do?
A “MapReduce Partitioner” makes sure that all the values of a single key go to the same “reducer”, thus allowing even distribution of the map output over the “reducers”. It redirects the “mapper” output to the “reducer” by determining which “reducer” is responsible for the particular key.

31. How will you write a custom partitioner?

Custom partitioner for a Hadoop job can be written easily by following the below steps:
  • Create a new class that extends Partitioner Class
  • Override method – getPartition, in the wrapper that runs in the MapReduce.
  • Add the custom partitioner to the job by using method set Partitioner or add the custom partitioner to the job as a config file.

32. What is a “Combiner”? 

A “Combiner” is a mini “reducer” that performs the local “reduce” task. It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”. “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”.

33. What do you know about “SequenceFileInputFormat”?

“SequenceFileInputFormat” is an input format for reading within sequence files. It is a specific compressed binary file format which is optimized for passing the data between the outputs of one “MapReduce” job to the input of some other “MapReduce” job.
Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce job to another.
https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif

Apache Pig

34. What are the benefits of Apache Pig over MapReduce? 

Apache Pig is a platform, used to analyze large data sets representing them as data flows developed by Yahoo. It is designed to provide an abstraction over MapReduce, reducing the complexities of writing a MapReduce program.
  • Pig Latin is a high-level data flow language, whereas MapReduce is a low-level data processing paradigm.
  • Without writing complex Java implementations in MapReduce, programmers can achieve the same implementations very easily using Pig Latin.
  • Apache Pig reduces the length of the code by approx 20 times (according to Yahoo). Hence, this reduces the development period by almost 16 times.
  • Pig provides many built-in operators to support data operations like joins, filters, ordering, sorting etc. Whereas to perform the same function in MapReduce is a humongous task.
  • Performing a Join operation in Apache Pig is simple. Whereas it is difficult in MapReduce to perform a Join operation between the data sets, as it requires multiple MapReduce tasks to be executed sequentially to fulfill the job.
  • In addition, pig also provides nested data types like tuples, bags, and maps that are missing from MapReduce.

35. What are different data types in Pig Latin?

Pig Latin can handle both atomic data types like int, float, long, double etc. and complex data types like tuple, bag and map.
Atomic data types: Atomic or scalar data types are the basic data types which are used in all the languages like string, int, float, long, double, char[], byte[].
Complex Data Types: Complex data types are Tuple, Map and Bag.
To know more about these data types, you can go through our Pig tutorial blog.

36. What are the different relational operations in “Pig Latin” you worked with?

Different relational operators are:
  1. for each
  2. order by
  3. filters
  4. group
  5. distinct
  6. join
  7. limit

37. What is a UDF?

If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring those functionalities using other languages like Java, Python, Ruby, etc. and embed it in Script file.

https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gifApache Hive

38. What is “SerDe” in “Hive”?

Apache Hive is a data warehouse system built on top of Hadoop and is used for analyzing structured and semi-structured data developed by Facebook. Hive abstracts the complexity of Hadoop MapReduce.
The “SerDe” interface allows you to instruct “Hive” about how a record should be processed. A “SerDe” is a combination of a “Serializer” and a “Deserializer”. “Hive” uses “SerDe” (and “FileFormat”) to read and write the table’s row.
To know more about Apache Hive, you can go through this Hive tutorial blog.

39. Can the default “Hive Metastore” be used by multiple users (processes) at the same time?

“Derby database” is the default “Hive Metastore”. Multiple users (processes) cannot access it at the same time. It is mainly used to perform unit tests.

40. What is the default location where “Hive” stores table data?

The default location where Hive stores table data is inside HDFS in /user/hive/warehouse.

Apache HBase

41. What is Apache HBase?

HBase is an open source, multidimensional, distributed, scalable and a NoSQL database written in Java. HBase runs on top of HDFS (Hadoop Distributed File System) and provides BigTable (Google) like capabilities to Hadoop. It is designed to provide a fault tolerant way of storing large collection of sparse data sets. HBase achieves high throughput and low latency by providing faster Read/Write Access on huge data sets.
To know more about HBase you can go through our HBase tutorial blog.

42. What are the components of Apache HBase?

HBase has three major components, i.e. HMaster Server, HBase RegionServer and Zookeeper.
  • Region Server: A table can be divided into several regions. A group of regions is served to the clients by a Region Server.
  • HMaster: It coordinates and manages the Region Server (similar as NameNode manages DataNode in HDFS).
  • ZooKeeper: Zookeeper acts like as a coordinator inside HBase distributed environment. It helps in maintaining server state inside the cluster by communicating through sessions.
To know more, you can go through this HBase architecture blog.

43. What are the components of Region Server?

The components of a Region Server are:
  • WAL: Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage.
  • Block Cache: Block Cache resides in the top of Region Server. It stores the frequently read data in the memory.
  • MemStore: It is the write cache. It stores all the incoming data before committing it to the disk or permanent memory. There is one MemStore for each column family in a region.
  • HFile: HFile is stored in HDFS. It stores the actual cells on the disk.

44. Explain “WAL” in HBase?

Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage. It is used in case of failure to recover the data sets.

45. Mention the differences between “HBase” and “Relational Databases”?

HBase is an open source, multidimensional, distributed, scalable and a NoSQL database written in Java. HBase runs on top of HDFS and provides BigTable like capabilities to Hadoop. Let us see the differences between HBase and relational database.
HBase
Relational Database
It is schema-less
It is schema based database
It is column-oriented data store
It is row-oriented data store
It is used to store de-normalized data
It is used to store normalized data
It contains sparsely populated tables
It contains thin tables
Automated partitioning is done is HBase
There is no such provision or built-in support for partitioning

Apache Spark

46. What is Apache Spark?

The answer to this question is, Apache Spark is a framework for real time data analytics in a distributed computing environment. It executes in-memory computations to increase the speed of data processing.
It is 100x faster than MapReduce for large scale data processing by exploiting in-memory computations and other optimizations.

47. Can you build “Spark” with any particular Hadoop version?

Yes, one can build “Spark” for a specific Hadoop version. Check out this blog to learn more about building YARN and HIVE on Spark. 

48. Define RDD.

RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD are immutable and distributed, which is a key component of Apache Spark.

Oozie & ZooKeeper

49. What is Apache ZooKeeper and Apache Oozie?

Apache ZooKeeper coordinates with various services in a distributed environment. It saves a lot of time by performing synchronization, configuration maintenance, grouping and naming.
Apache Oozie is a scheduler which schedules Hadoop jobs and binds them together as one logical work. There are two kinds of Oozie jobs:
  • Oozie Workflow: These are sequential set of actions to be executed. You can assume it as a relay race. Where each athlete waits for the last one to complete his part.
  • Oozie Coordinator: These are the Oozie jobs which are triggered when the data is made available to it. Think of this as the response-stimuli system in our body. In the same manner as we respond to an external stimulus, an Oozie coordinator responds to the availability of data and it rests otherwise.

50. How do you configure an “Oozie” job in Hadoop?

“Oozie” is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs such as “Java MapReduce”, “Streaming MapReduce”, “Pig”, “Hive” and “Sqoop”.
To understand “Oozie” in detail and learn how to configure an “Oozie” job, do check out this introduction to Apache Oozie blog.
Feeling overwhelmed with all the questions the interviewer might ask in your Hadoop interview? Now it is time to go through a series of Hadoop interview questions which covers different aspects of Hadoop framework. It’s never too late to strengthen your basics. Learn Hadoop from industry experts while working with real-life use cases.

https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif


3 comments:

Venkatesh CS said...

This information is really useful to me.
hadoop interview questions
hadoop interview questions and answers online
hadoop interview questions and answers pdf online
hadoop interview questions online
frequently asked hadoop interview questions

Anonymous said...

Your post is very great.I read this post. It's very helpful. I will definitely go ahead and take advantage of this. You absolutely have wonderful stories. Cheers for sharing with us your blog.
Python training in Noida
https://www.trainingbasket.in

veera Ravula said...

This post is very useful for me,keep sharing more Posts.

Thank you...

big data online course