Hadoop Interview Question and Answer

What is Hadoop?

Ans. Hadoop is an open-source software framework equipped with relevant tools and services to process and store Big Data. It is used to store a massive amount of data, handle virtually limitless concurrent tasks, and run applications on clusters of commodity hardware.

What are the primary components of Hadoop?

Ans. The primary components of Hadoop are:

Data Access Components – HDFS, Hadoop MapReduce, Hadoop Common, and YARN
Data Storage Component – HBase
Management and Monitoring Components – Ambari, Oozie, and ZooKeeper
Data Serialization components – Thrift and Avro
Integration Components – Apache Flume, Sqoop, and Chukwa
Data Intelligence Components – Apache Mahout and Drill

Name the different Hadoop configuration files.

Ans. The different Hadoop configuration files are:

hadoop-env.sh
core-site.xml
mapred-site.xml
hdfs-site.xml
yarn-site.xml
Master
Slaves

How are Hadoop and Big Data co-related?

Ans. Big Data is an asset, while Hadoop is an open-source software program, which accomplishes a set of goals and objectives to deal with that asset. Hadoop is used to process, store, and analyze complex unstructured data sets through specific proprietary algorithms and methods to derive actionable insights. So yes, they are related but are not alike.

Why is Hadoop used in Big Data analytics?

Ans. Hadoop is an open-source framework in Java, and it processes even big volumes of data on a cluster of commodity hardware. It also allows running many exploratory data analysis tasks on full datasets, without sampling.

Features that make Hadoop an essential requirement for Big Data are –

Massive data collection and storage
Data processing
Runs independently

What is the command for starting all the Hadoop daemons together?

Ans. The command for starting all the Hadoop daemons together is –

./sbin/start-all.sh

What are the most common input formats in Hadoop?

Ans. The most common input formats in Hadoop are –

Key-value input format
Sequence file input format
Text input format

What are the different file formats that can be used in Hadoop?

Ans. File formats used with Hadoop, include –

CSV
JSON
Columnar
Sequence files
AVRO
Parquet file

Name the most popular data management tools used with Edge Nodes in Hadoop.

Ans. The most commonly used data management tools that work with Edge Nodes in Hadoop are –

Oozie
Ambari
Pig
Flume

Name the modes in which Hadoop can run.

Ans. Hadoop can run on three modes, which are –

Standalone mode
Pseudo Distributed mode (Single node cluster)
Fully distributes mode (Multiple node cluster)

What is the functionality of the ‘jps’ command?

Ans. The ‘jps’ command enables us to check if the Hadoop daemons like namenode, datanode, resourcemanager, nodemanager, etc. are running on the machine.

What is a Mapper?

Ans. Mapper is the first code responsible for migrating or manipulating the HDFS block stored data into key and value pair. There is one mapper for every data block on HDFS.

Mention the basic parameters of a Mapper.

Ans. A Mapper is –

LongWritable and Text
Text and IntWritable

What is Hadoop streaming?

Ans. Hadoop Streaming is a generic API that enables a user to create and run Map/Reduce jobs with any executable or script or any programming language like Python, Perl, Ruby, etc. Spark is the latest tool for Hadoop streaming.

What is NAS?

Ans. NAS is the abbreviation for Network-Attached Storage (NAS). It is a file-level computer data storage server, which is connected to a computer network. It offers data access to a heterogeneous group.

What is Avro Serialization in Hadoop?

Ans. Avro Serialization in Hadoop is the process through which objects or data structures states are translated into binary or textual form. This is done to transport the data over the network or to store on some persistent storage. Avro Serialization is known as marshaling while deserialization in Avro is called unmarshalling.

What is HDFS and what are its components?

Ans. HDFS or Hadoop Distributed File System runs on commodity hardware and is highly fault-tolerant. HDFS provides file permissions and authentication and is suitable for distributed storage and processing. It is composed of three elements, including NameNode, DataNode, and Secondary NameNode.

What is FSCK?

Ans. FSCK or File System Check is a command used by HDFS. It checks if any file is corrupt, has its replica, or if there are some missing blocks for a file. FSCK generates a summary report, which lists the overall health of the file system.

What are the differences between NAS and HDFS?

Ans. The differences between NAS and HDFS are:

NAS	HDFS
Runs on a single machine	Runs on a cluster of different machines
No probability of data redundancy	Chances of data redundancy due to replication protocol
Stores data on a dedicated hardware	Data blocks are distributed across local drives
Does not use Hadoop MapReduce	Works with Hadoop MapReduce

What happens when multiple clients try to write on the same HDFS file?

Ans. Multiple users cannot write on the same HDFS file at a similar time. When the first user is accessing the file, inputs from the second user will be rejected because HDFS NameNode supports exclusive write.

Explain active and passive “NameNodes”?

Ans. A NameNode maintains all the metadata information of the data nodes. There are two NameNodes in a HA (High Availability) architecture, namely Active NameNode and Passive or Standby NameNode.

The Active NameNode works and runs in the cluster while the Passive NameNode is a standby NameNode, which has similar data as the active NameNode. In case the active NameNode fails, then the passive NameNode will replace the active NameNode in the cluster. Thus, the cluster is never without a NameNode and it never fails.

How NameNode handles DataNode failures in Hadoop?

Ans. The HDFS has master-slave architecture, where NameNode is the master and DataNode is the slave. NameNode periodically receives a Heartbeat signal from each of the DataNode in the cluster, implying that the DataNode is functioning properly.

A block report has the list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat, it is marked dead or non-functional after a specific period. Once the data node is declared dead, the NameNode replicates the blocks of the dead node to another DataNode using the replicas created earlier.

What is the use of dfsadmin -refreshNodes and rmadmin -refreshNodes commands?

Ans: The uses of dfsadmin -refreshNodes and rmadmin -refreshNodes commands are:

The dfsadmin –refreshNodes command is used to run the HDFS client. It refreshes node configuration for the NameNode.

The rmadmin –refreshNodes command is used to carry out administrative tasks for ResourceManager.

Which command will you use to copy data from the local system onto HDFS?

Ans. The following command is used to copy data from the local system onto HDFS:

Hadoop copyFromLocal command will copy the file from the local file system to the HDFS.
Format: hadoop fs –copyFromLocal [source] [destination]

Which commands will you use to find the status of blocks and FileSystem health?

Ans. The following command is used to check the status of the blocks:

hdfs fsck <path> -files -blocks
The following command is used to check the health status of FileSystem:
hdfs fsck / -files –blocks –locations > dfs-fsck.log

What is Hadoop MapReduce?

Ans. Hadoop MapReduce is a framework used to process large data sets in parallel across a Hadoop cluster.

How does the Hadoop MapReduce function?

Ans. When is MapReduce job is in progress, Hadoop sends the Map and Reduce tasks to the respective servers in the Hadoop cluster. The framework then aggregates all the data and manages all the related details of data passing, including task issues, task completion verification, and data copy.

Name Hadoop-specific data types that are used in a MapReduce program.

Ans. Some Hadoop-specific data types that are used in your MapReduce program are:

IntWritable
FloatWritable
ArrayWritable
DoubleWritable
MapWritable
ObjectWritable
BooleanWritable
LongWritable

Name the major configuration parameters required in a MapReduce program.

Ans. The following are the major configuration parameters in a MapReduce program:

Input location of the jobs in HDFS
Output location of the jobs in HDFS
The input format of data
The output format of data
Classes containing a map function
Classes containing a reduce function

What Is Apache Yarn?

Ans. YARN is an integral part of Hadoop 2.0 and is an abbreviation for Yet Another Resource Negotiator. It is a resource management layer of Hadoop and allows different data processing engines like graph processing, interactive processing, stream processing, and batch processing to run and process data stored in HDFS.

Name the main components of Apache Yarn.

Ans. ResourceManager and NodeManager are the two main components of YARN.

Name various Hadoop and YARN daemons.

Ans. Hadoop daemons are –

NameNode
Datanode
Secondary NameNode
YARN daemons
ResourceManager
NodeManager
JobHistoryServer

What is the standard path for Hadoop Sqoop scripts?

Ans. The standard path for Hadoop Sqoop scripts is –

/usr/bin/Hadoop Sqoop

What is the main difference between Sqoop and distCP?

Ans. DistCP is used for transferring data between clusters, while Sqoop is used for transferring data between Hadoop and RDBMS, only.

Explain the different components of a Hive architecture?

Ans. The different components of Hive architecture are:

User Interface: It offers an interface between the user and the hive. It enables users to submit queries to the system. The user interface creates a session handle to the query and sends it to the compiler to generate an execution plan for it.
Compiler: It generates the execution plan.
Execute Engine: It works like a bridge between the Hive and Hadoop to process the query.
Metastore: It stores the metadata information and sends the metadata to the compiler for the execution of the query on receiving the send metadata request.

Name the components used in Hive query processors?

Ans. The components used in Hive query processors are:

Parser
Optimizer
Operators
Execution Engine
Semantic Analyzer
User-Defined Functions
Logical Plan Generation
Physical Plan Generation

What are the major components of the Hive?

Ans. The Hive consists of 3 major components:

Clients
Services
Storage and Computing

Explain the key components of HBase?

Ans. The main/key components of HBase are:

Region server

It has the HBase tables that are divided horizontally into regions based on their key values. Each region server is a worker node and manages the read, writes, updates, and delete request from clients.

HMaster

It assigns regions to RegionServers for load balancing. HMaster monitors the Hadoop cluster. It is used when a client wants to change the schema and metadata operations.

ZooKeeper

It offers a distributed coordination service to maintain the server state in the cluster. It identifies the servers that are alive and available and provides server failure notifications.

Name the different operational commands in HBase at the record level and table level?

Ans. The operational commands in HBase are:

Record Level Operational Commands:

Get
Put
Scan
Increment
Delete

Table Level Operational Commands:

List
Drop
Describe
Disable
Scan

Name some data manipulation commands of HBase.

Ans. The data manipulation commands of HBase are:

Count
Get
Put
Delete
Deleteall
Scan
Truncate

What are the different types of tombstone markers in HBase for deletion?

Ans. The 3 types of tombstone markers in HBase for deletion are:

Family Delete Marker: Marks all columns for a column family
Version Delete Marker: Marks a single version of a column for deletion
Column Delete Marker: Marks all the versions of a column

Hadoop Interview Question and Answer

Bybpci

By bpci

Related Post

Hadoop Admin Interview Question and Answer

University of Mumbai CDOE accepting online applications for UG, PG programmes, check course details

Alexandr Wang educational qualifications: How an MIT dropout built a $29 billion AI empire – Times of India

Foreign universities from USA, Australia, Italy & others to set up campuses in Navi Mumbai

After edtech crash, study-abroad was the bright spot. Now, it faces testing time

University of Western Australia to launch its first global campuses, with academic hubs planned in Chennai and Mumbai

University of Mumbai CDOE accepting online applications for UG, PG programmes, check course details

Alexandr Wang educational qualifications: How an MIT dropout built a $29 billion AI empire – Times of India

Foreign universities from USA, Australia, Italy & others to set up campuses in Navi Mumbai

After edtech crash, study-abroad was the bright spot. Now, it faces testing time

Latest Education News Update

University of Western Australia to launch its first global campuses, with academic hubs planned in Chennai and Mumbai

University of Mumbai CDOE accepting online applications for UG, PG programmes, check course details

Alexandr Wang educational qualifications: How an MIT dropout built a $29 billion AI empire – Times of India

Foreign universities from USA, Australia, Italy & others to set up campuses in Navi Mumbai