- What is Hadoop?
Ans. Hadoop is an open-source software framework equipped with relevant tools and services to process and store Big Data. It is used to store a massive amount of data, handle virtually limitless concurrent tasks, and run applications on clusters of commodity hardware.
- What are the primary components of Hadoop?
Ans. The primary components of Hadoop are:
- Data Access Components – HDFS, Hadoop MapReduce, Hadoop Common, and YARN
- Data Storage Component – HBase
- Management and Monitoring Components – Ambari, Oozie, and ZooKeeper
- Data Serialization components – Thrift and Avro
- Integration Components – Apache Flume, Sqoop, and Chukwa
- Data Intelligence Components – Apache Mahout and Drill
- Name the different Hadoop configuration files.
Ans. The different Hadoop configuration files are:
- hadoop-env.sh
- core-site.xml
- mapred-site.xml
- hdfs-site.xml
- yarn-site.xml
- Master
- Slaves
- How are Hadoop and Big Data co-related?
Ans. Big Data is an asset, while Hadoop is an open-source software program, which accomplishes a set of goals and objectives to deal with that asset. Hadoop is used to process, store, and analyze complex unstructured data sets through specific proprietary algorithms and methods to derive actionable insights. So yes, they are related but are not alike.
- Why is Hadoop used in Big Data analytics?
Ans. Hadoop is an open-source framework in Java, and it processes even big volumes of data on a cluster of commodity hardware. It also allows running many exploratory data analysis tasks on full datasets, without sampling.
Features that make Hadoop an essential requirement for Big Data are –
- Massive data collection and storage
- Data processing
- Runs independently
- What is the command for starting all the Hadoop daemons together?
Ans. The command for starting all the Hadoop daemons together is –
./sbin/start-all.sh
- What are the most common input formats in Hadoop?
Ans. The most common input formats in Hadoop are –
- Key-value input format
- Sequence file input format
- Text input format
- What are the different file formats that can be used in Hadoop?
Ans. File formats used with Hadoop, include –
- CSV
- JSON
- Columnar
- Sequence files
- AVRO
- Parquet file
- Name the most popular data management tools used with Edge Nodes in Hadoop.
Ans. The most commonly used data management tools that work with Edge Nodes in Hadoop are –
- Oozie
- Ambari
- Pig
- Flume
- Name the modes in which Hadoop can run.
Ans. Hadoop can run on three modes, which are –
- Standalone mode
- Pseudo Distributed mode (Single node cluster)
- Fully distributes mode (Multiple node cluster)
- What is the functionality of the ‘jps’ command?
Ans. The ‘jps’ command enables us to check if the Hadoop daemons like namenode, datanode, resourcemanager, nodemanager, etc. are running on the machine.
- What is a Mapper?
Ans. Mapper is the first code responsible for migrating or manipulating the HDFS block stored data into key and value pair. There is one mapper for every data block on HDFS.
- Mention the basic parameters of a Mapper.
Ans. A Mapper is –
- LongWritable and Text
- Text and IntWritable
- What is Hadoop streaming?
Ans. Hadoop Streaming is a generic API that enables a user to create and run Map/Reduce jobs with any executable or script or any programming language like Python, Perl, Ruby, etc. Spark is the latest tool for Hadoop streaming.
- What is NAS?
Ans. NAS is the abbreviation for Network-Attached Storage (NAS). It is a file-level computer data storage server, which is connected to a computer network. It offers data access to a heterogeneous group.
- What is Avro Serialization in Hadoop?
Ans. Avro Serialization in Hadoop is the process through which objects or data structures states are translated into binary or textual form. This is done to transport the data over the network or to store on some persistent storage. Avro Serialization is known as marshaling while deserialization in Avro is called unmarshalling.
- What is HDFS and what are its components?
Ans. HDFS or Hadoop Distributed File System runs on commodity hardware and is highly fault-tolerant. HDFS provides file permissions and authentication and is suitable for distributed storage and processing. It is composed of three elements, including NameNode, DataNode, and Secondary NameNode.
- What is FSCK?
Ans. FSCK or File System Check is a command used by HDFS. It checks if any file is corrupt, has its replica, or if there are some missing blocks for a file. FSCK generates a summary report, which lists the overall health of the file system.
- What are the differences between NAS and HDFS?
Ans. The differences between NAS and HDFS are:
NAS | HDFS |
Runs on a single machine | Runs on a cluster of different machines |
No probability of data redundancy | Chances of data redundancy due to replication protocol |
Stores data on a dedicated hardware | Data blocks are distributed across local drives |
Does not use Hadoop MapReduce | Works with Hadoop MapReduce |
- What happens when multiple clients try to write on the same HDFS file?
Ans. Multiple users cannot write on the same HDFS file at a similar time. When the first user is accessing the file, inputs from the second user will be rejected because HDFS NameNode supports exclusive write.
- Explain active and passive “NameNodes”?
Ans. A NameNode maintains all the metadata information of the data nodes. There are two NameNodes in a HA (High Availability) architecture, namely Active NameNode and Passive or Standby NameNode.
The Active NameNode works and runs in the cluster while the Passive NameNode is a standby NameNode, which has similar data as the active NameNode. In case the active NameNode fails, then the passive NameNode will replace the active NameNode in the cluster. Thus, the cluster is never without a NameNode and it never fails.
- How NameNode handles DataNode failures in Hadoop?
Ans. The HDFS has master-slave architecture, where NameNode is the master and DataNode is the slave. NameNode periodically receives a Heartbeat signal from each of the DataNode in the cluster, implying that the DataNode is functioning properly.
A block report has the list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat, it is marked dead or non-functional after a specific period. Once the data node is declared dead, the NameNode replicates the blocks of the dead node to another DataNode using the replicas created earlier.
- What is the use of dfsadmin -refreshNodes and rmadmin -refreshNodes commands?
Ans: The uses of dfsadmin -refreshNodes and rmadmin -refreshNodes commands are:
- The dfsadmin –refreshNodes command is used to run the HDFS client. It refreshes node configuration for the NameNode.
- The rmadmin –refreshNodes command is used to carry out administrative tasks for ResourceManager.
- Which command will you use to copy data from the local system onto HDFS?
Ans. The following command is used to copy data from the local system onto HDFS:
- Hadoop copyFromLocal command will copy the file from the local file system to the HDFS.
- Format: hadoop fs –copyFromLocal [source] [destination]
- Which commands will you use to find the status of blocks and FileSystem health?
Ans. The following command is used to check the status of the blocks:
- hdfs fsck <path> -files -blocks
- The following command is used to check the health status of FileSystem:
- hdfs fsck / -files –blocks –locations > dfs-fsck.log
- What is Hadoop MapReduce?
Ans. Hadoop MapReduce is a framework used to process large data sets in parallel across a Hadoop cluster.
- How does the Hadoop MapReduce function?
Ans. When is MapReduce job is in progress, Hadoop sends the Map and Reduce tasks to the respective servers in the Hadoop cluster. The framework then aggregates all the data and manages all the related details of data passing, including task issues, task completion verification, and data copy.
- Name Hadoop-specific data types that are used in a MapReduce program.
Ans. Some Hadoop-specific data types that are used in your MapReduce program are:
- IntWritable
- FloatWritable
- ArrayWritable
- DoubleWritable
- MapWritable
- ObjectWritable
- BooleanWritable
- LongWritable
- Name the major configuration parameters required in a MapReduce program.
Ans. The following are the major configuration parameters in a MapReduce program:
- Input location of the jobs in HDFS
- Output location of the jobs in HDFS
- The input format of data
- The output format of data
- Classes containing a map function
- Classes containing a reduce function
- What Is Apache Yarn?
Ans. YARN is an integral part of Hadoop 2.0 and is an abbreviation for Yet Another Resource Negotiator. It is a resource management layer of Hadoop and allows different data processing engines like graph processing, interactive processing, stream processing, and batch processing to run and process data stored in HDFS.
- Name the main components of Apache Yarn.
Ans. ResourceManager and NodeManager are the two main components of YARN.
- Name various Hadoop and YARN daemons.
Ans. Hadoop daemons are –
- NameNode
- Datanode
- Secondary NameNode
- YARN daemons
- ResourceManager
- NodeManager
- JobHistoryServer
- What is the standard path for Hadoop Sqoop scripts?
Ans. The standard path for Hadoop Sqoop scripts is –
/usr/bin/Hadoop Sqoop
- What is the main difference between Sqoop and distCP?
Ans. DistCP is used for transferring data between clusters, while Sqoop is used for transferring data between Hadoop and RDBMS, only.
- Explain the different components of a Hive architecture?
Ans. The different components of Hive architecture are:
- User Interface: It offers an interface between the user and the hive. It enables users to submit queries to the system. The user interface creates a session handle to the query and sends it to the compiler to generate an execution plan for it.
- Compiler: It generates the execution plan.
- Execute Engine: It works like a bridge between the Hive and Hadoop to process the query.
- Metastore: It stores the metadata information and sends the metadata to the compiler for the execution of the query on receiving the send metadata request.
- Name the components used in Hive query processors?
Ans. The components used in Hive query processors are:
- Parser
- Optimizer
- Operators
- Execution Engine
- Semantic Analyzer
- User-Defined Functions
- Logical Plan Generation
- Physical Plan Generation
- What are the major components of the Hive?
Ans. The Hive consists of 3 major components:
- Clients
- Services
- Storage and Computing
- Explain the key components of HBase?
Ans. The main/key components of HBase are:
- Region server
It has the HBase tables that are divided horizontally into regions based on their key values. Each region server is a worker node and manages the read, writes, updates, and delete request from clients.
- HMaster
It assigns regions to RegionServers for load balancing. HMaster monitors the Hadoop cluster. It is used when a client wants to change the schema and metadata operations.
- ZooKeeper
It offers a distributed coordination service to maintain the server state in the cluster. It identifies the servers that are alive and available and provides server failure notifications.
- Name the different operational commands in HBase at the record level and table level?
Ans. The operational commands in HBase are:
Record Level Operational Commands:
- Get
- Put
- Scan
- Increment
- Delete
Table Level Operational Commands:
- List
- Drop
- Describe
- Disable
- Scan
- Name some data manipulation commands of HBase.
Ans. The data manipulation commands of HBase are:
- Count
- Get
- Put
- Delete
- Deleteall
- Scan
- Truncate
- What are the different types of tombstone markers in HBase for deletion?
Ans. The 3 types of tombstone markers in HBase for deletion are:
- Family Delete Marker: Marks all columns for a column family
- Version Delete Marker: Marks a single version of a column for deletion
- Column Delete Marker: Marks all the versions of a column