Hadoop Admin Interview Q & A

Part 1:
1) What daemons are needed to run a Hadoop cluster?
DataNode, NameNode, TaskTracker, and JobTracker are required to run Hadoop cluster.
2) Which OS are supported by Hadoop deployment?
The main OS use for Hadoop is Linux. However, by using some additional software, it can be deployed on Windows platform.
3) What are the common Input Formats in Hadoop?
Three widely used input formats are:
- Text Input: It is default input format in Hadoop.
- Key Value: It is used for plain text files
- Sequence: Use for reading files in sequence
4) What modes can Hadoop code be run in?
Hadoop can be deployed in
- Standalone mode
- Pseudo-distributed mode
- Fully distributed mode.
5) What is the main difference between RDBMS and Hadoop?
RDBMS is used for transactional systems to store and process the data whereas Hadoop can be used to store the huge amount of data.
6) What are the important hardware requirements for a Hadoop cluster?
There are no specific requirements for data nodes.
However, the namenodes need a specific amount of RAM to store filesystem image in memory. This depends on the particular design of the primary and secondary namenode.
7) How would you deploy different components of Hadoop in production?
You need to deploy jobtracker and namenode on the master node then deploy datanodes on multiple slave nodes.
8) What do you need to do as Hadoop admin after adding new datanodes?
You need to start the balancer for redistributing data equally between all nodes so that Hadoop cluster will find new datanodes automatically. To optimize the cluster performance, you should start rebalancer to redistribute the data between datanodes.
9) What are the Hadoop shell commands can use for copy operation?
The copy operation command are:
fs –copyToLocal
fs –put
fs –copyFromLocal.
10) What is the Importance of the namenode?
The role of namenonde is very crucial in Hadoop. It is the brain of the Hadoop. It is largely responsible for managing the distribution blocks on the system. It also supplies the specific addresses for the data based when the client made a request.
11) Explain how you will restart a NameNode?
The easiest way of doing is to run the command to stop running sell script.
Just click on stop.all.sh. then restarts the NameNode by clocking on start-all-sh.
12) What happens when the NameNode is down?
If the NameNode is down, the file system goes offline.
13) Is it possible to copy files between different clusters? If yes, How can you achieve this?
Yes, we can copy files between multiple Hadoop clusters. This can be done using distributed copy.
14) Is there any standard method to deploy Hadoop?
No, there are now standard procedure to deploy data using Hadoop. There are few general requirements for all Hadoop distributions. However, the specific methods will always different for each Hadoop admin.
15) What is distcp?
Distcp is a Hadoop copy utility. It is mainly used for performing MapReduce jobs to copy data. The key challenges in the Hadoop environment is copying data across various clusters, and distcp will also offer to provide multiple datanodes for parallel copying of the data.
16) What is a checkpoint?
Checkpointing is a method which takes a FsImage. It edits log and compacts them into a new FsImage. Therefore, instead of replaying an edit log, the NameNode can be load in the final in-memory state directly from the FsImage. This is surely more efficient operation which reduces NameNode startup time.
17) What is rack awareness?
It is a method which decides how to put blocks base on the rack definitions. Hadoop will try to limit the network traffic between datanodes which is present in the same rack. So that, it will only contact remote.
18) What is the use of ‘jps’ command?
The ‘jps’ command helps us to find that the Hadoop daemons are running or not. It also displays all the Hadoop daemons like namenode, datanode, node manager, resource manager, etc. which are running on the machine.
19) Name some of the essential Hadoop tools for effective working with Big Data?
“Hive,” HBase, HDFS, ZooKeeper, NoSQL, Lucene/SolrSee, Avro, Oozie, Flume, Clouds, and SQL are some of the Hadoop tools that enhance the performance of Big Data.
20) How many times do you need to reformat the namenode?
The namenode only needs to format once in the beginning. After that, it will never formated. In fact, reformatting of the namenode can lead to loss of the data on entire the namenode.
21) What is speculative execution?
If a node is executing a task slower then the master node. Then there is needs to redundantly execute one more instance of the same task on another node. So the task finishes first will be accepted and the other one likely to be killed. This process is known as “speculative execution.”
22) What is Big Data?
Big data is a term which describes the large volume of data. Big data can be used to make better decisions and strategic business moves.
23) What is Hadoop and its components?
When “Big Data” emerged as a problem, Hadoop evolved as a solution for it. It is a framework which provides various services or tools to store and process Big Data. It also helps to analyze Big Data and to make business decisions which are difficult using the traditional method.
24) What are the essential features of Hadoop?
Hadoop framework has the competence of solving many questions for Big Data analysis. It’s designed on Google MapReduce which is based on Google’s Big Data file systems.
25) What is the main difference between an “Input Split” and “HDFS Block”?
“Input Split” is the logical division of the data while The “HDFS Block” is the physical division of the data.
Part 2:
1) How will you decide whether you need to use the Capacity Scheduler or the Fair Scheduler?
Fair Scheduling is the process in which resources are assigned to jobs such that all jobs get to share equal number of resources over time. Fair Scheduler can be used under the following circumstances –
- i) If you wants the jobs to make equal progress instead of following the FIFO order then you must use Fair Scheduling.
- ii) If you have slow connectivity and data locality plays a vital role and makes a significant difference to the job runtime then you must use Fair Scheduling.
iii) Use fair scheduling if there is lot of variability in the utilization between pools.
Capacity Scheduler allows runs the hadoop mapreduce cluster as a shared, multi-tenant cluster to maximize the utilization of the hadoop cluster and throughput.Capacity Scheduler can be used under the following circumstances –
- i) If the jobs require scheduler detrminism then Capacity Scheduler can be useful.
- ii) CS’s memory based scheduling method is useful if the jobs have varying memory requirements.
iii) If you want to enforce resource allocation because you know very well about the cluster utilization and workload then use Capacity Scheduler.
2) What are the daemons required to run a Hadoop cluster?
NameNode, DataNode, TaskTracker and JobTracker
3) How will you restart a NameNode?
The easiest way of doing this is to run the command to stop running shell script i.e. click on stop-all.sh. Once this is done, restarts the NameNode by clicking on start-all.sh.
Build an impressive Hadoop Project portfolio by working on interesting hadoop project ideas.
4) Explain about the different schedulers available in Hadoop.
- FIFO Scheduler – This scheduler does not consider the heterogeneity in the system but orders the jobs based on their arrival times in a queue.
- COSHH- This scheduler considers the workload, cluster and the user heterogeneity for scheduling decisions.
- Fair Sharing-This Hadoop scheduler defines a pool for each user. The pool contains a number of map and reduce slots on a resource. Each user can use their own pool to execute the jobs.
5) List few Hadoop shell commands that are used to perform a copy operation.
- fs –put
- fs –copyToLocal
- fs –copyFromLocal
6) What is jps command used for?
jps command is used to verify whether the daemons that run the Hadoop cluster are working or not. The output of jps command shows the status of the NameNode, Secondary NameNode, DataNode, TaskTracker and JobTracker.
7) What are the important hardware considerations when deploying Hadoop in production environment?
- Memory-System’s memory requirements will vary between the worker services and management services based on the application.
- Operating System – a 64-bit operating system avoids any restrictions to be imposed on the amount of memory that can be used on worker nodes.
- Storage- It is preferable to design a Hadoop platform by moving the compute activity to data to achieve scalability and high performance.
- Capacity- Large Form Factor (3.5”) disks cost less and allow to store more, when compared to Small Form Factor disks.
- Network – Two TOR switches per rack provide better redundancy.
- Computational Capacity- This can be determined by the total number of MapReduce slots available across all the nodes within a Hadoop cluster.
8) How many NameNodes can you run on a single Hadoop cluster?
Only one.
9) What happens when the NameNode on the Hadoop cluster goes down?
The file system goes offline whenever the NameNode is down.
10) What is the conf/hadoop-env.sh file and which variable in the file should be set for Hadoop to work?
This file provides an environment for Hadoop to run and consists of the following variables-HADOOP_CLASSPATH, JAVA_HOME and HADOOP_LOG_DIR. JAVA_HOME variable should be set for Hadoop to run.
11) Apart from using the jps command is there any other way that you can check whether the NameNode is working or not.
Use the command -/etc/init.d/hadoop-0.20-namenode status.
12) In a MapReduce system, if the HDFS block size is 64 MB and there are 3 files of size 127MB, 64K and 65MB with FileInputFormat. Under this scenario, how many input splits are likely to be made by the Hadoop framework.
2 splits each for 127 MB and 65 MB files and 1 split for the 64KB file.
13) Which command is used to verify if the HDFS is corrupt or not?
Hadoop FSCK (File System Check) command is used to check missing blocks.
14) List some use cases of the Hadoop Ecosystem
Text Mining, Graph Analysis, Semantic Analysis, Sentiment Analysis, Recommendation Systems.
15) How can you kill a Hadoop job?
Hadoop job –kill jobID
16) I want to see all the jobs running in a Hadoop cluster. How can you do this?
Using the command – Hadoop job –list, gives the list of jobs running in a Hadoop cluster.
17) Is it possible to copy files across multiple clusters? If yes, how can you accomplish this?
Yes, it is possible to copy files across multiple Hadoop clusters and this can be achieved using distributed copy. DistCP command is used for intra or inter cluster copying.
18) Which is the best operating system to run Hadoop?
Ubuntu or Linux is the most preferred operating system to run Hadoop. Though Windows OS can also be used to run Hadoop but it will lead to several problems and is not recommended.
19) What are the network requirements to run Hadoop?
- SSH is required to run – to launch server processes on the slave nodes.
- A password less SSH connection is required between the master, secondary machines and all the slaves.
20) The mapred.output.compress property is set to true, to make sure that all output files are compressed for efficient space usage on the Hadoop cluster. In case under a particular condition if a cluster user does not require compressed data for a job. What would you suggest that he do?
If the user does not want to compress the data for a particular job then he should create his own configuration file and set the mapred.output.compress property to false. This configuration file then should be loaded as a resource into the job.
21) What is the best practice to deploy a secondary NameNode?
It is always better to deploy a secondary NameNode on a separate standalone machine. When the secondary NameNode is deployed on a separate machine it does not interfere with the operations of the primary node.
22) How often should the NameNode be reformatted?
The NameNode should never be reformatted. Doing so will result in complete data loss. NameNode is formatted only once at the beginning after which it creates the directory structure for file system metadata and namespace ID for the entire file system.
23) If Hadoop spawns 100 tasks for a job and one of the job fails. What does Hadoop do?
The task will be started again on a new Task Tracker and if it fails more than 4 times which is the default setting (the default value can be changed), the job will be killed.
24) How can you add and remove nodes from the Hadoop cluster?
- To add new nodes to the HDFS cluster, the hostnames should be added to the slaves file and then Data Node and Task Tracker should be started on the new node.
- To remove or decommission nodes from the HDFS cluster, the hostnames should be removed from the slaves file and –refresh Nodes should be executed.
25) You increase the replication level but notice that the data is under replicated. What could have gone wrong?
Nothing could have actually wrong, if there is huge volume of data because data replication usually takes times based on data size as the cluster has to copy the data and it might take a few hours.
26) Explain about the different configuration files and where are they located.
The configuration files are located in “conf” sub directory. Hadoop has 3 different Configuration files- hdfs-site.xml, core-site.xml and mapred-site.xml
Part 3:
- Name the daemons required to run a Hadoop cluster?
Daemons required to run a Hadoop Cluster | |
Daemon | Description |
DataNode | It stores the data in the Hadoop File System which contains more than one DataNode, with data replicated across them |
NameNode | It is the core of an HDFS that keeps the directory tree of all files is present in the file system, and tracks where the file data is kept across the cluster |
SecondaryNameNode | It is a specially dedicated node in HDFS cluster that keep checkpoints of the file system metadata present on namenode |
NodeManager | It is responsible for launching and managing containers on a node which execute tasks as specified by the AppMaster |
ResourceManager | It is the master that helps in managing the distributed applications running on the YARN system by arbitrating all the available cluster resources |
- How do you read a file from HDFS?
The following are the steps for doing this:
- The client uses a Hadoop client program to make the request.
- Client program reads the cluster config file on the local machine which tells it where the namemode is located. This has to be configured ahead of time.
- The client contacts the NameNode and requests the file it would like to read.
- Client validation is checked by username or by strong authentication mechanism like Kerberos.
- The client’s validated request is checked against the owner and permissions of the file.
- If the file exists and the user has access to it then the NameNode responds with the first block id and provides a list of datanodes a copy of the block can be found, sorted by their distance to the client (reader).
- The client now contacts the most appropriate datanode directly and reads the block data. This process repeats until all blocks in the file have been read or the client closes the file stream.
If while reading the file the datanode dies, the library will automatically attempt to read another replica of the data from another datanode. If all replicas are unavailable, the read operation fails and the client receives an exception. In case the information returned by the NameNode about block locations are outdated by the time the client attempts to contact a datanode, a retry will occur if there are other replicas or the read will fail.
- Explain checkpointing in Hadoop and why is it important?
Checkpointing is an essential part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient Namenode recovery and restart and is an important indicator of overall cluster health.
Namenode persists filesystem metadata. At a high level, namenode’s primary responsibility is to store the HDFS namespace. Meaning, things like the directory tree, file permissions and the mapping of files to block IDs. It is essential that this metadata are safely persisted to stable storage for fault tolerance.
This filesystem metadata is stored in two different parts: the fsimage and the edit log. The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata. However, while the fsimage file format is very efficient to read, it’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation in the edit log for durability. This way, if the NameNode crashes, it can restore its state by first loading the fsimage then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state of the namesystem. The edit log comprises a series of files, called edit log segments that together represent all the namesystem modifications made since the creation of the fsimage.
- What is default block size in HDFS and what are the benefits of having smaller block sizes?
Most block-structured file systems use a block size on the order of 4 or 8 KB. By contrast, the default block size in HDFS is 64MB – and larger. This allows HDFS to decrease the amount of metadata storage required per file. Furthermore, it allows fast streaming reads of data, by keeping large amounts of data sequentially organized on the disk. As a result, HDFS is expected to have very large files that are read sequentially. Unlike a file system such as NTFS or EXT which has numerous small files, HDFS stores a modest number of very large files: hundreds of megabytes, or gigabytes each.
- What are two main modules which help you interact with HDFS and what are they used for?
user@machine:hadoop$ bin/hadoop moduleName-cmdargs…
The module Name tells the program which subset of Hadoop functionality to use. -cmd is the name of a specific command within this module to execute. Its arguments follow the command name.
The two modules relevant to HDFS are: dfs and dfsadmin.
Explore Curriculum
The dfs module, also known as ‘FsShell’, provides basic file manipulation operations and works with objects within the file system. The dfsadmin module manipulates or queries the file system as a whole.
- How can I setup Hadoop nodes (data nodes/namenodes) to use multiple volumes/disks?
Datanodes can store blocks in multiple directories typically located on different local disk drives. In order to setup multiple directories one needs to specify a comma separated list of pathnames as values under config paramters dfs.data.dir/dfs.datanode.data.dir. Datanodes will attempt to place equal amount of data in each of the directories.
Namenode also supports multiple directories, which stores the name space image and edit logs. In order to setup multiple directories one needs to specify a comma separated list of pathnames as values under config paramters dfs.name.dir/dfs.namenode.data.dir. The namenode directories are used for the namespace data replication so that image and log could be restored from the remaining disks/volumes if one of the disks fails.
- What are schedulers and what are the three types of schedulers that can be used in Hadoop cluster?
Schedulers are responsible for assigning tasks to open slots on tasktrackers. The scheduler is a plug-in within the jobtracker. The three types of schedulers are:
- FIFO (First in First Out) Scheduler
- Fair Scheduler
- Capacity Scheduler
- How do you decide which scheduler to use?
The CS scheduler can be used under the following situations:
- When you know a lot about your cluster workloads and utilization and simply want to enforce resource allocation.
- When you have very little fluctuation within queue utilization. The CS’s more rigid resource allocation makes sense when all queues are at capacity almost all the time.
- When you have high variance in the memory requirements of jobs and you need the CS’s memory-based scheduling support.
- When you demand scheduler determinism.
The Fair Scheduler can be used over the Capacity Scheduler under the following conditions:
- When you have a slow network and data locality makes a significant difference to a job runtime, features like delay scheduling can make a dramatic difference in the effective locality rate of map tasks.
- When you have a lot of variability in the utilization between pools, the Fair Scheduler’s pre-emption model affects much greater overall cluster utilization by giving away otherwise reserved resources when they’re not used.
- When you require jobs within a pool to make equal progress rather than running in FIFO order.
- Why are ‘dfs.name.dir’ and ‘dfs.data.dir’ parameters used ? Where are they specified and what happens if you don’t specify these parameters?
DFS.NAME.DIR specifies the path of directory in Namenode’s local file system to store HDFS’s metadata and DFS.DATA.DIR specifies the path of directory in Datanode’s local file system to store HDFS’s file blocks. These paramters are specified in HDFS-SITE.XML config file of all nodes in the cluster, including master and slave nodes.
If these paramters are not specified, namenode’s metadata and Datanode’s file blocks related information gets stored in /tmp under HADOOP-USERNAME directory. This is not a safe place, as when nodes are restarted, data will be lost and is critical if Namenode is restarted, as formatting information will be lost.
- What is file system checking utility FSCK used for? What kind of information does it show? Can FSCK show information about files which are open for writing by a client?
FileSystem checking utility FSCK is used to check and display the health of file system, files and blocks in it. When used with a path ( bin/Hadoop fsck / -files –blocks –locations -racks) it recursively shows the health of all files under the path. And when used with ‘/’ , it checks the entire file system. By Default FSCK ignores files still open for writing by a client. To list such files, run FSCK with -openforwrite option.
FSCK checks the file system, prints out a dot for each file found healthy, prints a message of the ones that are less than healthy, including the ones which have over replicated blocks, under-replicated blocks, mis-replicated blocks, corrupt blocks and missing replicas.
- What are the important configuration files that need to be updated/edited to setup a fully distributed mode of Hadoop cluster 1.x ( Apache distribution)?
The Configuration files that need to be updated to setup a fully distributed mode of Hadoop are:
- Hadoop-env.sh
- Core-site.xml
- Hdfs-site.xml
- Mapred-site.xml
- Masters
- Slaves
These files can be found in your Hadoop>conf directory. If Hadoop daemons are started individually using ‘bin/Hadoop-daemon.sh start xxxxxx’ where xxxx is the name of daemon, then masters and slaves file need not be updated and can be empty. This way of starting daemons requires command to be issued on appropriate nodes to start appropriate daemons. If Hadoop daemons are started using ‘bin/start-dfs.sh’ and ‘bin/start-mapred.sh’, then masters and slaves configurations files on namenode machine need to be updated.
Masters – Ip address/hostname of node where secondarynamenode will run.
Slaves –Ip address/hostname of nodes where datanodes will be run and eventually task trackers.
Part 4:
- RDBMS vs Hadoop?
Name | RDBMS | Hadoop |
Data volume | RDBMS cannot store and process a large amount of data | Hadoop works better for large amounts of data. It can easily store and process a large amount of data compared to RDBMS. |
Throughput | RDBMS fails to achieve a high Throughput | Hadoop achieves high Throughput |
Data variety | Schema of the data is known in RDBMS and it always depends on the structured data. | It stores any kind of data. Whether it could be structured, unstructured, or semi-structured. |
Data processing | RDBMS supports OLTP(Online Transactional Processing) | Hadoop supports OLAP(Online Analytical Processing) |
Read/Write Speed | Reads are fast in RDBMS because the schema of the data is already known. | Writes are fast in Hadoop because no schema validation happens during HDFS write. |
Schema on reading Vs Write | RDBMS follows schema on write policy | Hadoop follows the schema on reading policy |
Cost | RDBMS is a licensed software | Hadoop is a free and open-source framework |
- Explain big data and its characteristics?
Big Data refers to a large amount of data that exceeds the processing capacity of conventional database systems and requires a special parallel processing mechanism. This data can be either structured or unstructured data.
Characteristics of Big Data:
- Volume – It represents the amount of data that is increasing at an exponential rate i.e. in gigabytes, Petabytes, Exabytes, etc.
- Velocity – Velocity refers to the rate at which data is generated, modified, and processed. At present, Social media is a major contributor to the velocity of growing data.
- Variety – It refers to data coming from a variety of sources like audios, videos, CSV, etc. It can be either structured, unstructured, or semi-structured.
- Veracity – Veracity refers to imprecise or uncertain data.
- Value – This is the most important element of big data. It includes data on how to access and deliver quality analytics to the organization. It provides a fair market value on the used technology.
- What is Hadoop and list its components?
Hadoop is an open-source framework used for storing large data sets and runs applications across clusters of commodity hardware.
It offers extensive storage for any type of data and can handle endless parallel tasks.
Core components of Hadoop:
- Storage unit– HDFS (DataNode, NameNode)
- Processing framework– YARN (NodeManager, ResourceManager)
- What is YARN and explain its components?
Yet Another Resource Negotiator (YARN) is one of the core components of Hadoop and is responsible for managing resources for the various applications operating in a Hadoop cluster, and also schedules tasks on different cluster nodes.
YARN components:
- Resource Manager – It runs on a master daemon and controls the resource allocation in the cluster.
- Node Manager – It runs on a slave daemon and is responsible for the execution of tasks for each single Data Node.
- Application Master – It maintains the user job lifecycle and resource requirements of individual applications. It operates along with the Node Manager and controls the execution of tasks.
- Container – It is a combination of resources such as Network, HDD, RAM, CPU, etc., on a single node.
- What is the difference between a regular file system and HDFS?
Regular File Systems | HDFS |
A small block size of data (like 512 bytes) | Large block size (orders of 64MB) |
Multiple disks seek large files | Reads data sequentially after single seek |
- What are the Hadoop daemons and explain their roles in a Hadoop cluster?
Generally, the daemon is nothing but a process that runs in the background. Hadoop has five such daemons. They are:
- NameNode – Is is the Master node responsible to store the meta-data for all the directories and files.
- DataNode – It is the Slave node responsible to store the actual data.
- Secondary NameNode – It is responsible for the backup of NameNode and stores the entire metadata of data nodes like data node properties, addresses, and block reports of each data node.
- JobTracker – It is used for creating and running jobs. It runs on data nodes and allocates the job to TaskTracker.
- TaskTracker – It operates on the data node. It runs the tasks and reports the tasks to JobTracker.
- What is Avro Serialization in Hadoop?
- The process of translating objects or data structures state into binary or textual form is called Avro Serialization. It is defined as a language-independent schema (written in JSON).
- It provides AvroMapper and AvroReducer for running MapReduce programs.
- How can you skip the bad records in Hadoop?
Hadoop provides a feature called SkipBadRecords class for skipping bad records while processing mapping inputs.
- Explain HDFS and its components?
- HDFS (Hadoop Distributed File System) is the primary data storage unit of Hadoop.
- It stores various types of data as blocks in a distributed environment and follows master and slave topology.
HDFS components:
- NameNode – It is the master node and is responsible for maintaining the metadata information for the blocks of data stored in HDFS. It manages all the DataNodes.
Ex: replication factors, block location, etc.
- DataNode – It is the slave node and responsible for storing data in the HDFS.
- What are the features of HDFS?
- Supports storage of very large datasets
- Write once read many access model
- Streaming data access
- Replication using commodity hardware
- HDFS is highly Fault Tolerant
- Distributed Storage
- What is the HDFS block size?
By default, the HDFS block size is 128MB for Hadoop 2.x.
- What is the default replication factor?
- Replication factor means the minimum number of times the file will replicate (copy) across the cluster.
- The default replication factor is 3
- List the various HDFS Commands?
The Various HDFS Commands are listed bellow
- version
- mkdir
- ls
- put
- copy from local
- get
- copyToLocal
- cat
- mv
- cp
- Compare HDFS (Hadoop Distributed File System) and NAS (Network Attached Storage)?
HDFS | NAS |
It is a distributed file system used for storing data by commodity hardware. | It is a file-level computer data storage server connected to a computer network, provides network access to a heterogeneous group of clients. |
It includes commodity hardware which will be cost-effective | NAS is a high-end storage device that includes a high cost. |
It is designed to work for the MapReduce paradigm. | It is not suitable for MapReduce. |
- What are the limitations of Hadoop 1.0?
- NameNode: No Horizontal Scalability and No High Availability
- Job Tracker: Overburdened.
- MRv1: It can only understand Map and Reduce tasks
- How to commission (adding) the nodes in the Hadoop cluster?
- Update the network addresses in the dfs.include and mapred.include
- Update the NameNode: Hadoop dfsadmin -refreshNodes
- Update the Jobtracker: Hadoop mradmin-refreshNodes
- Update the slave file.
- Start the DataNode and NodeManager on the added Node.
- How to decommission (removing) the nodes in the Hadoop cluster?
- Update the network addresses in the dfs.exclude and mapred.exclude
- Update the Namenode: $ Hadoop dfsadmin -refreshNodes
- Update the JobTracker: Hadoop mradmin -refreshNodes
- Cross-check the Web UI it will show “Decommissioning in Progress”
- Remove the Nodes from the include file and then run: Hadoop dfsadmin-refreshNodes, Hadoop mradmin -refreshNodes.
- Remove the Nodes from the slave file.
Q18) Compare Hadoop 1.x and Hadoop 2.x
Name | Hadoop 1.x | Hadoop 2.x |
1. NameNode | In Hadoop 1.x, NameNode is the single point of failure | In Hadoop 2.x, we have both Active and passive NameNodes. |
2. Processing | MRV1 (Job Tracker & Task Tracker) | MRV2/YARN (ResourceManager & NodeManager) |
- What is the difference between active and passive NameNodes?
- Active NameNode works and runs in the cluster.
- Passive NameNode has similar data as active NameNode and replaces it when it fails.
- How will you resolve the NameNode failure issue?
The following steps need to be executed to resolve the NameNode issue and make the Hadoop cluster up and running:
- Use the FsImage (file system metadata replica) to start a new NameNode.
- Now, configure DataNodes and clients, so that they can acknowledge the new NameNode, that is started.
- The new NameNode will start serving the client once it has completed loading the last checkpoint FsImage and enough block reports from the DataNodes.
- What is a Checkpoint Node in Hadoop?
Checkpoint Node is the new implementation of secondary NameNode in Hadoop. It periodically creates the checkpoints of filesystem metadata by merging the edits log file with FsImage file.
- List the different types of Hadoop schedulers.
- Hadoop FIFO scheduler
- Hadoop Fair Scheduler
- Hadoop Capacity Scheduler
- How to keep an HDFS cluster balanced?
However, it is not possible to limit a cluster from becoming unbalanced. In order to give a balance to a certain threshold among data nodes, use the Balancer tool. This tool tries to subsequently even out the block data distribution across the cluster.
- What is DistCp?
- DistCp is the tool used to copy large amounts of data to and from Hadoop file systems in parallel.
- It uses MapReduce to effect its distribution, reporting, recovery, and error handling.
- What is HDFS Federation?
- HDFS Federation enhances the present HDFS architecture through a clear separation of namespace and storage by enabling a generic block storage layer.
- It provides multiple namespaces in the cluster to improve scalability and isolation.
- What is HDFS High Availability?
HDFS High availability is introduced in Hadoop 2.0. It means providing support for multiple NameNodes to the Hadoop architecture.
- What is a rack-aware replica placement policy?
- Rack Awareness is the algorithm used for improving the network traffic while reading/writing HDFS files to the Hadoop cluster by NameNode.
- NameNode chooses the Datanode which is closer to the same rack or nearby rack for reading/Write request. The concept of choosing closer data nodes based on racks information is called Rack Awareness.
- Consider the replication factor is 3 for data blocks on HDFS it means for every block of data two copies are stored on the same rack, while the third copy is stored on a different rack. This rule is called Replica Placement Policy.
- What is the main purpose of the Hadoop fsck command?
Hadoop fsck command is used for checking the HDFS file system.
There are different arguments that can be passed with this command to emit different results.
- Hadoop fsck / -files: Displays all the files in HDFS while checking.
- Hadoop fsck / -files -blocks: Displays all the blocks of the files while checking.
- Hadoop fsck / -files -blocks -locations: Displays all the files block locations while checking.
- Hadoop fsck / -files -blocks -locations -racks: Displays the networking topology for data-node locations.
- Hadoop fsck -delete: Deletes the corrupted files in HDFS.
- Hadoop fsck -move: Moves the corrupted files to a particular directory.
- What is the purpose of a DataNode block scanner?
- The purpose of the DataNode block scanner is to operate and periodically check all the blocks that are stored on the DataNode.
- If bad blocks are detected they will be fixed before any client reads.
- What is the purpose of the admin tool?
- dfsadmin tool is used for examining the HDFS cluster status.
- dfsadmin – report command produces useful information about basic statistics of the cluster such as DataNodes and NameNode status, disk capacity configuration, etc.
- It performs all the administrative tasks on the HDFS.
- What is the command used for printing the topology?
.hdfs dfsadmin -point topology is used for printing the topology. It displays the tree of racks and DataNodes attached to the tracks.
- What is RAID?
RAID (redundant array of independent disks) is a data storage virtualization technology used for improving performance and data redundancy by combining multiple disk drives into a single entity.
- Does Hadoop requires RAID?
- In DataNodes, RAID is not necessary as storage is achieved by replication between the Nodes.
- In NameNode’s disk RAID is recommended.
- List the various site-specific configuration files available in Hadoop?
- conf/Hadoop-env.sh
- conf/yarn-site.xml
- conf/yarn-env.sh
- conf/mapred-site.xml
- conf/hdfs-site.xml
- conf/core-site.xml
- What is the main functionality of NameNode?
It is mainly responsible for:
- Namespace – Manages metadata of HDFS.
- Block Management – Processes and manages the block reports and their location.
- Which command is used to format the NameNode?
$ hdfs namenode -format
- How a client application interacts with the NameNode?
- Client applications associate the Hadoop HDFS API with the NameNode when it has to copy/move/add/locate/delete a file.
- The NameNode returns to the successful requests by delivering a list of relevant DataNode servers where the data is residing.
- The client can talk directly to a DataNode after the NameNode has given the location of the data
- What is MapReduce and list its features?
MapReduce is a programming model used for processing and generating large datasets on the clusters with parallel and distributed algorithms.
The syntax for running the MapReduce program is
hadoop_jar_file.jar /input_path /output_path.
- What are the features of MapReduce?
- Automatic parallelization and distribution.
- Built-in fault tolerance and redundancy are available.
- MapReduce Programming model is language independent
- Distributed programming complexity is hidden
- Enable data local processing
- Manages all the inter-process communication
- What does the MapReduce framework consist of?
Ans. MapReduce framework is used to write applications for processing large data in parallel on large clusters of commodity hardware.
It consists of:
ResourceManager (RM)
- Global resource scheduler
- One master RM
NodeManager (NM)
- One slave NM per cluster node.
Container
- RM creates Containers upon request by AM
- The application runs in one or more containers
ApplicationMaster (AM)
- One AM per application
- Runs in Container
- What are the two main components of ResourceManager?
- Scheduler
It allocates the resources (containers) to various running applications based on resource availability and configured shared policy.
- ApplicationManager
It is mainly responsible for managing a collection of submitted applications
- What is a Hadoop counter?
Hadoop Counters measures the progress or tracks the number of operations that occur within a MapReduce job. Counters are useful for collecting statistics about MapReduce jobs for application-level or quality control.
Q43) what are the main configuration parameters for a MapReduce application?
The job configuration requires the following:
- Job’s input and output locations in the distributed file system
- The input format of data
- The output format of data
- Class containing the map function and reduced function
- JAR file containing the reducer, driver, and mapper classes
- What are the steps involved to submit a Hadoop job?
Steps involved in Hadoop job submission:
- Hadoop job client submits the job jar/executable and configuration to the ResourceManager.
- ResourceManager then distributes the software/configuration to the slaves.
- ResourceManager then scheduling tasks and monitoring them.
- Finally, job status and diagnostic information are provided to the client.
- How does the MapReduce framework view its input internally?
It views the input data set as a set of pairs and processes the map tasks in a completely parallel manner.
- What are the basic parameters of Mapper?
The basic parameters of Mapper are listed below:
- LongWritable and Text
- Text and IntWritable
- What are Writables and explain their importance in Hadoop?
- Writables are interfaces in Hadoop. They act as a wrapper class to almost all the primitive data types of Java.
- A serializable object which executes a simple and efficient serialization protocol, based on DataInput and DataOutput.
- Writables are used for creating serialized data types in Hadoop.
- Why comparison of types is important for MapReduce?
- It is important for MapReduce as in the sorting phase the keys are compared with one another.
- For a Comparison of types, the WritableComparable interface is implemented.
- What is “speculative execution” in Hadoop?
In Apache Hadoop, if nodes do not fix or diagnose the slow-running tasks, the master node can redundantly perform another instance of the same task on another node as a backup (the backup task is called a Speculative task). This process is called Speculative Execution in Hadoop.
- What are the methods used for restarting the NameNode in Hadoop?
The methods used for restarting the NameNodes are the following:
- You can use /sbin/hadoop-daemon.sh stop namenode command for stopping the NameNode individually and then start the NameNode using /sbin/hadoop-daemon.sh start namenode.
- Use /sbin/stop-all.sh and then use /sbin/start-all.sh command for stopping all the demons first and then start all the daemons.
These script files are stored in the sbin directory inside the Hadoop directory store.
- What is the difference between an “HDFS Block” and “MapReduce Input Split”?
- HDFS Block is the physical division of the disk which has the minimum amount of data that can be read/write, while MapReduce InputSplit is the logical division of data created by the InputFormat specified in the MapReduce job configuration.
- HDFS divides data into blocks, whereas MapReduce divides data into input split and empowers them to mapper function.
- What are the different modes in which Hadoop can run?
- Standalone Mode (local mode) – This is the default mode where Hadoop is configured to run. In this mode, all the components of Hadoop such as DataNode, NameNode, etc., run as a single Java process and useful for debugging.
- Pseudo Distributed Mode (Single-Node Cluster) – Hadoop runs on a single node in a pseudo-distributed mode. Each Hadoop daemon works in a separate Java process in Pseudo-Distributed Mode, while in Local mode, each Hadoop daemon operates as a single Java process.
- Fully distributed mode (or multiple node cluster) – All the daemons are executed in separate nodes building into a multi-node cluster in the fully-distributed mode.
- Why aggregation cannot be performed in Mapperside?
- We cannot perform Aggregation in mapping because it requires sorting of data, which occurs only at the Reducer side.
- For aggregation, we need the output from all the mapper functions, which is not possible during the map phase as map tasks will be running in different nodes, where data blocks are present.
- What is the importance of “RecordReader” in Hadoop?
- RecordReader in Hadoop uses the data from the InputSplit as input and converts it into Key-value pairs for Mapper.
- The MapReduce framework represents the RecordReader instance through InputFormat.
- What is the purpose of Distributed Cache in a MapReduce Framework?
- The Purpose of Distributed Cache in the MapReduce framework is to cache files when needed by the applications. It caches read-only text files, jar files, archives, etc.
- When you have cached a file for a job, the Hadoop framework will make it available to each and every data node where map/reduces tasks are operating.
- How do reducers communicate with each other in Hadoop?
Reducers always run in isolation and the Hadoop Mapreduce programming paradigm never allows them to communicate with each other.
- What is Identity Mapper?
- Identity Mapper is a default Mapper class that automatically works when no Mapper is specified in the MapReduce driver class.
- It implements mapping inputs directly into the output.
- IdentityMapper.class is used as a default value when JobConf.setMapperClass is not set.
- What are the phases of MapReduce Reducer?
The MapReduce reducer has three phases:
- Shuffle phase – In this phase, the sorted output from a mapper is an input to the Reducer. This framework will fetch the relevant partition of the output of all the mappers by using HTTP.
- Sort phase – In this phase, the input from various mappers is sorted based on related keys. This framework groups reducer inputs by keys. Shuffle and sort phases occur concurrently.
- Reduce phase – In this phase, reduce task aggregates the key-value pairs after shuffling and sorting phases. The OutputCollector.collect() method, writes the output of the reduce task to the Filesystem.
- What is the purpose of MapReduce Partitioner in Hadoop?
The MapReduce Partitioner manages the partitioning of the key of the intermediate mapper output. It makes sure that all the values of a single key pass to same reducers by allowing the even distribution over the reducers.
- How will you write a custom partitioner for a Hadoop MapReduce job?
- Build a new class that extends Partitioner Class
- Override the get partition method in the wrapper.
- Add the custom partitioner to the job as a config file or by using the method set Partitioner.
- What is a Combiner?
A Combiner is a semi-reducer that executes the local reduce task. It receives inputs from the Map class and passes the output key-value pairs to the reducer class.
- What is the use of SequenceFileInputFormat in Hadoop?
SequenceFileInputFormat is the input format used for reading in sequence files. It is a compressed binary file format optimized for passing the data between outputs of one MapReduce job to the input of some other MapReduce job.
- What is Apache Pig?
- Apache Pig is a high-level scripting language used for creating programs to run on Apache Hadoop.
- The language used in this platform is called Pig Latin.
- It executes Hadoop jobs in Apache Spark, MapReduce, etc.
- What are the benefits of Apache Pig over MapReduce?
- Pig Latin is a high-level scripting language while MapReduce is a low-level data processing paradigm.
- Without many complex Java implementations in MapReduce, programmers can perform the same implementations very easily using Pig Latin.
- Apache Pig decreases the length of the code by approx 20 times (according to Yahoo). Hence, this reduces development time by almost 16 times.
- Pig offers various built-in operators for data operations like filters, joins, sorting, ordering, etc., while to perform these same functions in MapReduce is an enormous task.
- What are the Hadoop Pig data types?
Hadoop Pig runs both atomic data types and complex data types.
- Atomic data types: These are the basic data types that are used in all the languages like int, string, float, long, etc.
- Complex Data Types: These are Bag, Map, and Tuple.
- List the various relational operators used in “Pig Latin”?
- SPLIT
- LIMIT
- CROSS
- COGROUP
- GROUP
- STORE
- DISTINCT
- ORDER BY
- JOIN
- FILTER
- FOREACH
- LOAD
- What is Apache Hive?
Apache Hive offers a database query interface to Apache Hadoop. It reads, writes, and manages large datasets that are residing in distributed storage and queries through SQL syntax.
- Where do Hive stores table data in HDFS?
/usr/hive/warehouse is the default location where Hive stores the table data in HDFS.
- Can the default “Hive Metastore” be used by multiple users (processes) at the same time?
By default, Hive Metastore uses the Derby database. So, it is not possible for multiple users or processes to access it at the same time.
- What is a SerDe?
SerDe is a combination of Serializer and Deserializer. It interprets the results of how a record should be processed by allowing Hive to read and write from a table.
- What are the differences between Hive and RDBMS?
Hive | RDBMS |
Schema on Reading | Schema on write |
Batch processing jobs | Real-time jobs |
Data stored on HDFS | Data stored on the internal structure |
Processed using MapReduce | Processed using database |
- What is an Apache HBase?
Apache HBase is multidimensional and a column-oriented key datastore runs on top of HDFS (Hadoop Distributed File System). It is designed to provide high table-update rates and a fault-tolerant way to store a large collection of sparse data sets.
- What are the various components of Apache HBase?
- Region Server: These are the worker nodes that handle read, write, update, and delete requests from clients. The region Server process runs on each and every node of the Hadoop cluster
- HMaster: It monitors and manages the Region Server in the Hadoop cluster for load balancing.
- ZooKeeper: ZHBase employs ZooKeeper for a distributed environment. It keeps track of each and every region server that is present in the HBase cluster.
- What is WAL in HBase?
- Write-Ahead Log (WAL) is a file storage and it records all changes to data in HBase. It is used for recovering data sets.
- The WAL ensures all the changes to the data can be replayed when a RegionServer crashes or becomes unavailable.
- What are the differences between the Relational database and HBase?
Relational Database | HBase |
It is a row-oriented datastore | It is a column-oriented datastore |
It’s a schema-based database | Its schema is more flexible and less restrictive |
Suitable for structured data | Suitable for both structured and unstructured data |
Supports referential integrity | Doesn’t supports referential integrity |
It includes thin tables | It includes sparsely populated tables |
Accesses records from tables using SQL queries. | Accesses data from HBase tables using APIs and MapReduce. |
- What is Apache Spark?
Apache Spark is an open-source framework used for real-time data analytics in a distributed computing environment. It is a data processing engine that provides faster analytics than Hadoop MapReduce.
- Can we build “Spark” with any particular Hadoop version?
Yes, we can build “Spark” for any specific Hadoop version.
- What is RDD?
RDD(Resilient Distributed Datasets) is a fundamental data structure of Spark. It is a distributed collection of objects, and each dataset in RDD is further distributed into logical partitions and computed on several nodes of the cluster
- What is Apache ZooKeeper?
Apache ZooKeeper is a centralized service used for managing various operations in a distributed environment. It maintains configuration data, performs synchronization, naming, and grouping.
- What is Apache Oozie?
Apache Oozie is a scheduler that controls the workflow of Hadoop jobs.
There are two kinds of Oozie jobs:
- Oozie Workflow – It is a collection of actions sequenced in a control dependency DAG(Direct Acyclic Graph) for execution.
- Oozie Coordinator – If you want to trigger workflows based on the availability of data or time then you can use Oozie Coordinator Engine.
- How can you configure the “Oozie” job in Hadoop?
Integrate Oozie with the Hadoop stack, which supports several types of Hadoop jobs such as Streaming MapReduce, Java MapReduce, Sqoop, Hive, and Pig.
- What is an Apache Flume?
- Apache Flume is a service/tool/data ingestion mechanism used to collect, aggregate, and transfer massive amounts of streaming data such as events, log files, etc., from various web sources to a centralized data store where they can be processed together.
- It is a highly reliable, distributed, and configurable tool that is specially designed to transfer streaming data to HDFS.
- List the Apache Flume features.
- It is fault-tolerant and robust
- Scales horizontally
- Selects high volume data streams in real-time
- Streaming data is gathered from multiple sources into Hadoop for analysis.
- Ensures guaranteed data delivery
- What is the use of Apache Sqoop in Hadoop?
Apache Sqoop is a tool particularly used for transferring massive data between Apache Hadoop and external datastores such as relational database management, enterprise data warehouses, etc.
- Where do Hadoop Sqoop scripts are stored?
/usr/bin/Hadoop Sqoop
Part 5:
- What is Rack awareness? And why is it necessary?
Answer:
Rack awareness is about distributing data nodes across multiple racks.HDFS follows the rack awareness algorithm to place the data blocks. A rack holds multiple servers. And for a cluster, there could be multiple racks. Let’s say there is a Hadoop cluster set up with 12 nodes. There could be 3 racks with 4 servers on each. All 3 racks are connected so that all 12 nodes are connected and that form a cluster. While deciding on the track count, the important point to consider is the replication factor. Suppose there is 100GB of data that will flow every day with the replication factor 3. Then it’s 300GB of data that will have to reside on the cluster. It’s a better option to have the data replicated across the racks. Even if any node goes down, the replica will be in another rack.
- What is the default block size, and how is it defined?
Answer:
128MB, and it is defined in hdfs-site.xml, and also this is customizable depending on the volume of the data and the level of access. Say, 100GB of data flowing in a day, and the data gets segregated and stored across the cluster. What will be the number of files? 800 files. (1024*100/128) [1024 à converted a GB to MB.] There are two ways to set the customize data block size.
- hadoop fs -D fs.local.block.size=134217728 (in bits)
- In hdfs-site.xml add this property à block.size with the size of the bits.
If you change the default size to 512MB as the data size is huge, then the no.of files generated will be 200. (1024*100/512)
- How do you get the report of the hdfs file system? About disk availability and no.of active nodes?
Answer:
Command: sudo -u hdfs dfsadmin –report
These are the list of information it displays,
- Configured Capacity – Total capacity available in hdfs
- Present Capacity – This is the total amount of space allocated for the resources to reside beside the metastore and fsimage usage of space.
- DFS Remaining – It is the amount of storage space still available to the HDFS to store more files
- DFS Used – It is the storage space that HDFS has used up.
- DFS Used% – In percentage
- Under replicated blocks – No. of blocks
- Blocks with corrupt replicas – If any corrupted blocks
- Missing blocks
- Missing blocks (with replication factor 1)
- What is Hadoop balancer, and why is it necessary?
Answer:
The data spread across the nodes are not distributed in the right proportion, meaning each node’s utilisation might not be balanced. One node might be over-utilized, and the other could be under-utilized. This leads to having high costing effect while running any process, and it would end up running on heavy usage of those nodes. To solve this, Hadoop balancer is used to balance the utilization of the data in the nodes. So whenever a balancer is executed, the data gets moved across where the under-utilized nodes get filled up, and the over-utilized nodes will be freed up.
- Difference between Cloudera and Ambari?
Answer:
Cloudera Manager | Ambari |
Administration tool for Cloudera | Administration tool for Horton works |
Monitors and manages the entire cluster and reports the usage and any issues | Monitors and manages the entire cluster and reports the usage and any issues |
Comes with Cloudera paid service | Open-source |
- What are the main actions performed by the Hadoop admin?
Answer:
Monitor health of cluster -Many application pages have to be monitored if any processes run. (Job history server, YARN resource manager, Cloudera manager/ambary depending on the distribution)
turn on security – SSL or Kerberos
Tune performance – Hadoop balancer
Add new data nodes as needed – Infrastructure changes and configurations.
Optional to turn on MapReduce Job History Tracking Server à sometimes restarting the services would help release up cache memory. This is when the cluster with an empty process.
- What is Kerberos?
Answer:
It’s an authentication required for each service to sync up to run the process. It is recommended to enable Kerberos. Since we are dealing with distributed computing, it is always good practice to have encryption while accessing the data and processing it. As each node are connected, and any information passage is across a network. As Hadoop uses Kerberos, passwords not sent across the networks. Instead, passwords are used to compute the encryption keys. The messages are exchanged between the client and the server. In simple terms, Kerberos provides identity to each other (nodes) in a secure manner with the encryption.
Configuration in core-site.xml
Hadoop.security.authentication: Kerberos
- What is the important list of hdfs commands?
Answer:
Commands | Purpose |
hdfs dfs –ls <hdfs path> | To list the files from the hdfs filesystem. |
Hdfs dfs –put <local file> <hdfs folder> | Copy file from the local system to the hdfs filesystem |
Hdfs dfs –chmod 777 <hdfs file> | Give a read, write, execute permission to the file. |
Hdfs dfs –get <hdfs folder/file> <local filesystem> | Copy the file from hdfs filesystem to the local filesystem |
Hdfs dfs –cat <hdfs file> | View the file content from the hdfs filesystem |
Hdfs dfs –rm <hdfs file> | Remove the file from the hdfs filesystem. But it will be moved to trash file path (it’s like a recycle bin in windows) |
Hdfs dfs –rm –skipTrash <hdfs filesystem> | Removes the file permanently from the cluster. |
Hdfs dfs –touchz <hdfs file> | Create a file in the hdfs filesystem |
- How to check the logs of a Hadoop job submitted in the cluster and how to terminate already running process?
Answer:
yarn logs –applicationId <application_id> — The application master generates logs on its container, and it will be appended with the id it generates. This will be helpful to monitor the process running status and the log information.
Yarn application –kill <application_id> — If an existing process that was running in the cluster needs to be terminated, kill command is used where the application id is used to terminate the job in the cluster.
Part 6:
- What are the different vendor-specific distributions of Hadoop?
The different vendor-specific distributions of Hadoop are Cloudera, MAPR, Amazon EMR, Microsoft Azure, IBM InfoSphere, and Hortonworks (Cloudera).
- What are the different Hadoop configuration files?
The different Hadoop configuration files include:
- hadoop-env.sh
- mapred-site.xml
- core-site.xml
- yarn-site.xml
- hdfs-site.xml
- Master and Slaves
- What are the three modes in which Hadoop can run?
The three modes in which Hadoop can run are:
- Standalone mode: This is the default mode. It uses the local Filesystem and a single Java process to run the Hadoop services.
- Pseudo-distributed mode: This uses a single-node Hadoop deployment to execute all Hadoop services.
- Fully-distributed mode: This uses separate nodes to run Hadoop master and slave services.
- What are the differences between regular FileSystem and HDFS?
- Regular FileSystem: In regular FileSystem, data is maintained in a single system. If the machine crashes, data recovery is challenging due to low fault tolerance. Seek time is more and hence it takes more time to process the data.
- HDFS: Data is distributed and maintained on multiple systems. If a DataNode crashes, data can still be recovered from other nodes in the cluster. Time taken to read data is comparatively more, as there is local data read to the disc and coordination of data from multiple systems.
- Why is HDFS fault-tolerant?
HDFS is fault-tolerant because it replicates data on different DataNodes. By default, a block of data is replicated on three DataNodes. The data blocks are stored in different DataNodes. If one node crashes, the data can still be retrieved from other DataNodes.
- Explain the architecture of HDFS.
The architecture of HDFS is as shown:
For an HDFS service, we have a NameNode that has the master process running on one of the machines and DataNodes, which are the slave nodes.
NameNode
NameNode is the master service that hosts metadata in disk and RAM. It holds information about the various DataNodes, their location, the size of each block, etc.
DataNode
DataNodes hold the actual data blocks and send block reports to the NameNode every 10 seconds. The DataNode stores and retrieves the blocks when the NameNode asks. It reads and writes the client’s request and performs block creation, deletion, and replication based on instructions from the NameNode.
- Data that is written to HDFS is split into blocks, depending on its size. The blocks are randomly distributed across the nodes. With the auto-replication feature, these blocks are auto-replicated across multiple machines with the condition that no two identical blocks can sit on the same machine.
- As soon as the cluster comes up, the DataNodes start sending their heartbeats to the NameNodes every three seconds. The NameNode stores this information; in other words, it starts building metadata in RAM, which contains information about the DataNodes available in the beginning. This metadata is maintained in RAM, as well as in the disk.
- What are the two types of metadata that a NameNode server holds?
The two types of metadata that a NameNode server holds are:
- Metadata in Disk – This contains the edit log and the FSImage
- Metadata in RAM – This contains the information about DataNodes
- What is the difference between a federation and high availability?
HDFS Federation | HDFS High Availability |
· There is no limitation to the number of NameNodes and the NameNodes are not related to each other · All the NameNodes share a pool of metadata in which each NameNode will have its dedicated pool · Provides fault tolerance, i.e., if one NameNode goes down, it will not affect the data of the other NameNode | · There are two NameNodes that are related to each other. Both active and standby NameNodes work all the time · One at a time, active NameNodes will be up and running, while standby NameNodes will be idle and updating its metadata once in a while · It requires two separate machines. First, the active NameNode will be configured, while the secondary NameNode will be configured on the other system |
- If you have an input file of 350 MB, how many input splits would HDFS create and what would be the size of each input split?
By default, each block in HDFS is divided into 128 MB. The size of all the blocks, except the last block, will be 128 MB. For an input file of 350 MB, there are three input splits in total. The size of each split is 128 MB, 128MB, and 94 MB.
- How does rack awareness work in HDFS?
HDFS Rack Awareness refers to the knowledge of different DataNodes and how it is distributed across the racks of a Hadoop Cluster.
By default, each block of data is replicated three times on various DataNodes present on different racks. Two identical blocks cannot be placed on the same DataNode. When a cluster is “rack-aware,” all the replicas of a block cannot be placed on the same rack. If a DataNode crashes, you can retrieve the data block from different DataNodes.
- How can you restart NameNode and all the daemons in Hadoop?
The following commands will help you restart NameNode and all the daemons:
You can stop the NameNode with ./sbin /Hadoop-daemon.sh stop NameNode command and then start the NameNode using ./sbin/Hadoop-daemon.sh start NameNode command.
You can stop all the daemons with ./sbin /stop-all.sh command and then start the daemons using the ./sbin/start-all.sh command.
- Which command will help you find the status of blocks and FileSystem health?
To check the status of the blocks, use the command:
hdfs fsck <path> -files -blocks
To check the health status of FileSystem, use the command:
hdfs fsck / -files –blocks –locations > dfs-fsck.log
- What would happen if you store too many small files in a cluster on HDFS?
Storing several small files on HDFS generates a lot of metadata files. To store these metadata in the RAM is a challenge as each file, block, or directory takes 150 bytes for metadata. Thus, the cumulative size of all the metadata will be too large.
- How do you copy data from the local system onto HDFS?
The following command will copy data from the local file system onto HDFS:
hadoop fs –copyFromLocal [source] [destination]
Example: hadoop fs –copyFromLocal /tmp/data.csv /user/test/data.csv
In the above syntax, the source is the local path and destination is the HDFS path. Copy from the local system using a -f option (flag option), which allows you to write the same file or a new file to HDFS.
- When do you use the dfsadmin -refreshNodes and rmadmin -refreshNodes commands?
The commands below are used to refresh the node information while commissioning, or when the decommissioning of nodes is completed.
dfsadmin -refreshNodes
This is used to run the HDFS client and it refreshes node configuration for the NameNode.
rmadmin -refreshNodes
This is used to perform administrative tasks for ResourceManager.
- Is there any way to change the replication of files on HDFS after they are already written to HDFS?
Yes, the following are ways to change the replication of files on HDFS:
We can change the dfs.replication value to a particular number in the $HADOOP_HOME/conf/hadoop-site.xml file, which will start replicating to the factor of that number for any new content that comes in.
If you want to change the replication factor for a particular file or directory, use:
$HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /path of the file
Example: $HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /user/temp/test.csv
- Who takes care of replication consistency in a Hadoop cluster and what do under/over replicated blocks mean?
In a cluster, it is always the NameNode that takes care of the replication consistency. The fsck command provides information regarding the over and under-replicated block.
Under-replicated blocks:
These are the blocks that do not meet their target replication for the files they belong to. HDFS will automatically create new replicas of under-replicated blocks until they meet the target replication.
Consider a cluster with three nodes and replication set to three. At any point, if one of the NameNodes crashes, the blocks would be under-replicated. It means that there was a replication factor set, but there are not enough replicas as per the replication factor. If the NameNode does not get information about the replicas, it will wait for a limited amount of time and then start the re-replication of missing blocks from the available nodes.
Over-replicated blocks:
These are the blocks that exceed their target replication for the files they belong to. Usually, over-replication is not a problem, and HDFS will automatically delete excess replicas.
Consider a case of three nodes running with the replication of three, and one of the nodes goes down due to a network failure. Within a few minutes, the NameNode re-replicates the data, and then the failed node is back with its set of blocks. This is an over-replication situation, and the NameNode will delete a set of blocks from one of the nodes.
- What is the distributed cache in MapReduce?
A distributed cache is a mechanism wherein the data coming from the disk can be cached and made available for all worker nodes. When a MapReduce program is running, instead of reading the data from the disk every time, it would pick up the data from the distributed cache to benefit the MapReduce processing.
To copy the file to HDFS, you can use the command:
hdfs dfs-put /user/Simplilearn/lib/jar_file.jar
To set up the application’s JobConf, use the command:
DistributedCache.addFileToClasspath(newpath(“/user/Simplilearn/lib/jar_file.jar”), conf)
Then, add it to the driver class.
- What role do RecordReader, Combiner, and Partitioner play in a MapReduce operation?
RecordReader
This communicates with the InputSplit and converts the data into key-value pairs suitable for the mapper to read.
Combiner
This is an optional phase; it is like a mini reducer. The combiner receives data from the map tasks, works on it, and then passes its output to the reducer phase.
Partitioner
The partitioner decides how many reduced tasks would be used to summarize the data. It also confirms how outputs from combiners are sent to the reducer, and controls the partitioning of keys of the intermediate map outputs.
- Why is MapReduce slower in processing data in comparison to other processing frameworks?
This is quite a common question in Hadoop interviews; let us understand why MapReduce is slower in comparison to the other processing frameworks:
MapReduce is slower because:
- It is batch-oriented when it comes to processing data. Here, no matter what, you would have to provide the mapper and reducer functions to work on data.
- During processing, whenever the mapper function delivers an output, it will be written to HDFS and the underlying disks. This data will be shuffled and sorted, and then be picked up for the reducing phase. The entire process of writing data to HDFS and retrieving it from HDFS makes MapReduce a lengthier process.
- In addition to the above reasons, MapReduce also uses Java language, which is difficult to program as it has multiple lines of code.
- Is it possible to change the number of mappers to be created in a MapReduce job?
By default, you cannot change the number of mappers, because it is equal to the number of input splits. However, there are different ways in which you can either set a property or customize the code to change the number of mappers.
For example, if you have a 1GB file that is split into eight blocks (of 128MB each), there will only be only eight mappers running on the cluster. However, there are different ways in which you can either set a property or customize the code to change the number of mappers.
- Name some Hadoop-specific data types that are used in a MapReduce program.
This is an important question, as you would need to know the different data types if you are getting into the field of Big Data.
For every data type in Java, you have an equivalent in Hadoop. Therefore, the following are some Hadoop-specific data types that you could use in your MapReduce program:
- IntWritable
- FloatWritable
- LongWritable
- DoubleWritable
- BooleanWritable
- ArrayWritable
- MapWritable
- ObjectWritable
- What is speculative execution in Hadoop?
If a DataNode is executing any task slowly, the master node can redundantly execute another instance of the same task on another node. The task that finishes first will be accepted, and the other task would be killed. Therefore, speculative execution is useful if you are working in an intensive workload kind of environment.
The following image depicts the speculative execution:
From the above example, you can see that node A has a slower task. A scheduler maintains the resources available, and with speculative execution turned on, a copy of the slower task runs on node B. If node A task is slower, then the output is accepted from node B.
- How is identity mapper different from chain mapper?
Identity Mapper | Chain Mapper |
· This is the default mapper that is chosen when no mapper is specified in the MapReduce driver class. · It implements identity function, which directly writes all its key-value pairs into output. · It is defined in old MapReduce API (MR1) in: org.apache.Hadoop.mapred.lib.package | · This class is used to run multiple mappers in a single map task. · The output of the first mapper becomes the input to the second mapper, second to third and so on. · It is defined in: org.apache.Hadoop.mapreduce.lib.chain.ChainMapperpackage |
- What are the major configuration parameters required in a MapReduce program?
We need to have the following configuration parameters:
- Input location of the job in HDFS
- Output location of the job in HDFS
- Input and output formats
- Classes containing a map and reduce functions
- JAR file for mapper, reducer and driver classes
- What do you mean by map-side join and reduce-side join in MapReduce?
Map-side join | Reduce-side join |
· The mapper performs the join · Each input data must be divided into the same number of partitions · Input to each map is in the form of a structured partition and is in sorted order | · The reducer performs the join · Easier to implement than the map side join, as the sorting and shuffling phase sends the value with identical keys to the same reducer · No need to have the dataset in a structured form (or partitioned) |
- What is the role of the OutputCommitter class in a MapReduce job?
As the name indicates, OutputCommitter describes the commit of task output for a MapReduce job.
Example: org.apache.hadoop.mapreduce.OutputCommitter
public abstract class OutputCommitter extends OutputCommitter
MapReduce relies on the OutputCommitter for the following:
- Set up the job initialization
- Cleaning up the job after the job completion
- Set up the task’s temporary output
- Check whether a task needs a commit
- Committing the task output
- Discard the task commit
- Explain the process of spilling in MapReduce.
Spilling is a process of copying the data from memory buffer to disk when the buffer usage reaches a specific threshold size. This happens when there is not enough memory to fit all of the mapper output. By default, a background thread starts spilling the content from memory to disk after 80 percent of the buffer size is filled.
For a 100 MB size buffer, the spilling will start after the content of the buffer reaches a size of 80 MB.
- How can you set the mappers and reducers for a MapReduce job?
The number of mappers and reducers can be set in the command line using:
-D mapred.map.tasks=5 –D mapred.reduce.tasks=2
In the code, one can configure JobConf variables:
job.setNumMapTasks(5); // 5 mappers
job.setNumReduceTasks(2); // 2 reducers
- What happens when a node running a map task fails before sending the output to the reducer?
If this ever happens, map tasks will be assigned to a new node, and the entire task will be rerun to re-create the map output. In Hadoop v2, the YARN framework has a temporary daemon called application master, which takes care of the execution of the application. If a task on a particular node failed due to the unavailability of a node, it is the role of the application master to have this task scheduled on another node.
- Can we write the output of MapReduce in different formats?
Yes. Hadoop supports various input and output File formats, such as:
- TextOutputFormat – This is the default output format and it writes records as lines of text.
- SequenceFileOutputFormat – This is used to write sequence files when the output files need to be fed into another MapReduce job as input files.
- MapFileOutputFormat – This is used to write the output as map files.
- SequenceFileAsBinaryOutputFormat – This is another variant of SequenceFileInputFormat. It writes keys and values to a sequence file in binary format.
- DBOutputFormat – This is used for writing to relational databases and HBase. This format also sends the reduce output to a SQL table.
- What benefits did YARN bring in Hadoop 2.0 and how did it solve the issues of MapReduce v1?
In Hadoop v1, MapReduce performed both data processing and resource management; there was only one master process for the processing layer known as JobTracker. JobTracker was responsible for resource tracking and job scheduling.
Managing jobs using a single JobTracker and utilization of computational resources was inefficient in MapReduce 1. As a result, JobTracker was overburdened due to handling, job scheduling, and resource management. Some of the issues were scalability, availability issue, and resource utilization. In addition to these issues, the other problem was that non-MapReduce jobs couldn’t run in v1.
To overcome this issue, Hadoop 2 introduced YARN as the processing layer. In YARN, there is a processing master called ResourceManager. In Hadoop v2, you have ResourceManager running in high availability mode. There are node managers running on multiple machines, and a temporary daemon called application master. Here, the ResourceManager is only handling the client connections and taking care of tracking the resources.
In Hadoop v2, the following features are available:
- Scalability – You can have a cluster size of more than 10,000 nodes and you can run more than 100,000 concurrent tasks.
- Compatibility – The applications developed for Hadoop v1 run on YARN without any disruption or availability issues.
- Resource utilization – YARN allows the dynamic allocation of cluster resources to improve resource utilization.
- Multitenancy – YARN can use open-source and proprietary data access engines, as well as perform real-time analysis and run ad-hoc queries.
- Explain how YARN allocates resources to an application with the help of its architecture.
There is a client/application/API which talks to ResourceManager. The ResourceManager manages the resource allocation in the cluster. It has two internal components, scheduler, and application manager. The ResourceManager is aware of the resources that are available with every node manager. The scheduler allocates resources to various running applications when they are running in parallel. It schedules resources based on the requirements of the applications. It does not monitor or track the status of the applications.
Applications Manager is what accepts job submissions. It monitors and restarts the application masters in case of failures. Application Master manages the resource needs of individual applications. It interacts with the scheduler to acquire the required resources, and with NodeManager to execute and monitor tasks, which tracks the jobs running. It monitors each container’s resource utilization.
A container is a collection of resources, such as RAM, CPU, or network bandwidth. It provides the rights to an application to use a specific amount of resources.
Whenever a job submission happens, ResourceManager requests the NodeManager to hold some resources for processing. NodeManager then guarantees the container that would be available for processing. Next, the ResourceManager starts a temporary daemon called application master to take care of the execution. The App Master, which the applications manager launches, will run in one of the containers. The other containers will be utilized for execution. This is briefly how YARN takes care of the allocation.
- Which of the following has replaced JobTracker from MapReduce v1?
- NodeManager
- ApplicationManager
- ResourceManager
- Scheduler
The answer is ResourceManager. It is the name of the master process in Hadoop v2.
- Write the YARN commands to check the status of an application and kill an application.
The commands are as follows:
- a) To check the status of an application:
yarn application -status ApplicationID
- b) To kill or terminate an application:
yarn application -kill ApplicationID
- Can we have more than one ResourceManager in a YARN-based cluster?
Yes, Hadoop v2 allows us to have more than one ResourceManager. You can have a high availability YARN cluster where you can have an active ResourceManager and a standby ResourceManager, where the ZooKeeper handles the coordination.
There can only be one active ResourceManager at a time. If an active ResourceManager fails, then the standby ResourceManager comes to the rescue.
- What are the different schedulers available in YARN?
The different schedulers available in YARN are:
- FIFO scheduler – This places applications in a queue and runs them in the order of submission (first in, first out). It is not desirable, as a long-running application might block the small running applications
- Capacity scheduler – A separate dedicated queue allows the small job to start as soon as it is submitted. The large job finishes later compared to using the FIFO scheduler
- Fair scheduler – There is no need to reserve a set amount of capacity since it will dynamically balance resources between all the running jobs
- What happens if a ResourceManager fails while executing an application in a high availability cluster?
In a high availability cluster, there are two ResourceManagers: one active and the other standby. If a ResourceManager fails in the case of a high availability cluster, the standby will be elected as active and instructs the ApplicationMaster to abort. The ResourceManager recovers its running state by taking advantage of the container statuses sent from all node managers.
- In a cluster of 10 DataNodes, each having 16 GB RAM and 10 cores, what would be the total processing capacity of the cluster?
Every node in a Hadoop cluster will have one or multiple processes running, which would need RAM. The machine itself, which has a Linux file system, would have its own processes that need a specific amount of RAM usage. Therefore, if you have 10 DataNodes, you need to allocate at least 20 to 30 percent towards the overheads, Cloudera-based services, etc. You could have 11 or 12 GB and six or seven cores available on every machine for processing. Multiply that by 10, and that’s your processing capacity.
- What happens if requested memory or CPU cores go beyond the size of container allocation?
If an application starts demanding more memory or more CPU cores that cannot fit into a container allocation, your application will fail. This happens because the requested memory is more than the maximum container size.
Now that you have learned about HDFS, MapReduce, and YARN, let us move to the next section. We’ll go over questions about Hive, Pig, HBase, and Sqoop.
- What are the different components of a Hive architecture?
The different components of the Hive are:
- User Interface: This calls the execute interface to the driver and creates a session for the query. Then, it sends the query to the compiler to generate an execution plan for it
- Metastore: This stores the metadata information and sends it to the compiler for the execution of a query
- Compiler: This generates the execution plan. It has a DAG of stages, where each stage is either a metadata operation, a map, or reduces a job or operation on HDFS
- Execution Engine: This acts as a bridge between the Hive and Hadoop to process the query. Execution Engine communicates bidirectionally with Metastore to perform operations, such as create or drop tables.
- What is the difference between an external table and a managed table in Hive?
External Table | Managed Table |
· External tables in Hive refer to the data that is at an existing location outside the warehouse directory · Hive deletes the metadata information of a table and does not change the table data present in HDFS | · Also known as the internal table, these types of tables manage the data and move it into its warehouse directory by default · If one drops a managed table, the metadata information along with the table data is deleted from the Hive warehouse directory |
- What is a partition in Hive and why is partitioning required in Hive
Partition is a process for grouping similar types of data together based on columns or partition keys. Each table can have one or more partition keys to identify a particular partition.
Partitioning provides granularity in a Hive table. It reduces the query latency by scanning only relevant partitioned data instead of the entire data set. We can partition the transaction data for a bank based on month — January, February, etc. Any operation regarding a particular month, say February, will only have to scan the February partition, rather than the entire table data.
- Why does Hive not store metadata information in HDFS?
We know that the Hive’s data is stored in HDFS. However, the metadata is either stored locally or it is stored in RDBMS. The metadata is not stored in HDFS, because HDFS read/write operations are time-consuming. As such, Hive stores metadata information in the metastore using RDBMS instead of HDFS. This allows us to achieve low latency and is faster.
- What are the components used in Hive query processors?
The components used in Hive query processors are:
- Parser
- Semantic Analyzer
- Execution Engine
- User-Defined Functions
- Logical Plan Generation
- Physical Plan Generation
- Optimizer
- Operators
- Type checking
- Suppose there are several small CSV files present in /user/input directory in HDFS and you want to create a single Hive table from these files. The data in these files have the following fields: {registration_no, name, email, address}. What will be your approach to solve this, and where will you create a single Hive table for multiple smaller files without degrading the performance of the system?
Using SequenceFile format and grouping these small files together to form a single sequence file can solve this problem.
- Write a query to insert a new column(new_col INT) into a hive table (h_table) at a position before an existing column (x_col).
The following query will insert a new column:
ALTER TABLE h_table
CHANGE COLUMN new_col INT
BEFORE x_col
- What are the key differences between Hive and Pig?
Hive | Pig |
· It uses a declarative language, called HiveQL, which is similar to SQL for reporting. · Operates on the server-side of the cluster and allows structured data. · It does not support the Avro file format by default. This can be done using “Org.Apache.Hadoop.Hive.serde2.Avro” · Facebook developed it and it supports partition | · Uses a high-level procedural language called Pig Latin for programming · Operates on the client-side of the cluster and allows both structured and unstructured data · Supports Avro file format by default. · Yahoo developed it, and it does not support partition |
- How is Apache Pig different from MapReduce?
Pig | MapReduce |
· It has fewer lines of code compared to MapReduce. · A high-level language that can easily perform join operation. · On execution, every Pig operator is converted internally into a MapReduce job · Works with all versions of Hadoop | · Has more lines of code. · A low-level language that cannot perform join operation easily. · MapReduce jobs take more time to compile. · A MapReduce program written in one Hadoop version may not work with other versions |
- What are the different ways of executing a Pig script?
The different ways of executing a Pig script are as follows:
- Grunt shell
- Script file
- Embedded script
- What are the major components of a Pig execution environment?
The major components of a Pig execution environment are:
- Pig Scripts: They are written in Pig Latin using built-in operators and UDFs, and submitted to the execution environment.
- Parser: Completes type checking and checks the syntax of the script. The output of the parser is a Directed Acyclic Graph (DAG).
- Optimizer: Performs optimization using merge, transform, split, etc. Optimizer aims to reduce the amount of data in the pipeline.
- Compiler: Converts the optimized code into MapReduce jobs automatically.
- Execution Engine: MapReduce jobs are submitted to execution engines to generate the desired results.
- Explain the different complex data types in Pig.
Pig has three complex data types, which are primarily Tuple, Bag, and Map.
Tuple
A tuple is an ordered set of fields that can contain different data types for each field. It is represented by braces ().
Example: (1,3)
Bag
A bag is a set of tuples represented by curly braces {}.
Example: {(1,4), (3,5), (4,6)}
Map
A map is a set of key-value pairs used to represent data elements. It is represented in square brackets [ ].
Example: [key#value, key1#value1,….]
- What are the various diagnostic operators available in Apache Pig?
Pig has Dump, Describe, Explain, and Illustrate as the various diagnostic operators.
Dump
The dump operator runs the Pig Latin scripts and displays the results on the screen.
Load the data using the “load” operator into Pig.
Display the results using the “dump” operator.
Describe
Describe operator is used to view the schema of a relation.
Load the data using “load” operator into Pig
View the schema of a relation using “describe” operator
Explain
Explain operator displays the physical, logical and MapReduce execution plans.
Load the data using “load” operator into Pig
Display the logical, physical and MapReduce execution plans using “explain” operator
Illustrate
Illustrate operator gives the step-by-step execution of a sequence of statements.
Load the data using “load” operator into Pig
Show the step-by-step execution of a sequence of statements using “illustrate” operator
- State the usage of the group, order by, and distinct keywords in Pig scripts.
The group statement collects various records with the same key and groups the data in one or more relations.
Example: Group_data = GROUP Relation_name BY AGE
The order statement is used to display the contents of relation in sorted order based on one or more fields.
Example: Relation_2 = ORDER Relation_name1 BY (ASC|DSC)
Distinct statement removes duplicate records and is implemented only on entire records, and not on individual records.
Example: Relation_2 = DISTINCT Relation_name1
- What are the relational operators in Pig?
The relational operators in Pig are as follows:
COGROUP
It joins two or more tables and then performs GROUP operation on the joined table result.
CROSS
This is used to compute the cross product (cartesian product) of two or more relations.
FOREACH
This will iterate through the tuples of a relation, generating a data transformation.
JOIN
This is used to join two or more tables in a relation.
LIMIT
This will limit the number of output tuples.
SPLIT
This will split the relation into two or more relations.
UNION
It will merge the contents of two relations.
ORDER
This is used to sort a relation based on one or more fields.
- What is the use of having filters in Apache Pig?
Filter Operator is used to select the required tuples from a relation based on a condition. It also allows you to remove unwanted records from the data file.
Example: Filter the products with a whole quantity that is greater than 1000
A = LOAD ‘/user/Hadoop/phone_sales’ USING PigStorage(‘,’) AS (year:int, product:chararray, quantity:int);
B = FILTER A BY quantity > 1000
- Suppose there’s a file called “test.txt” having 150 records in HDFS. Write the PIG command to retrieve the first 10 records of the file.
To do this, we need to use the limit operator to retrieve the first 10 records from a file.
Load the data in Pig:
test_data = LOAD “/user/test.txt” USING PigStorage(‘,’) as (field1, field2,….);
Limit the data to first 10 records:
Limit_test_data = LIMIT test_data 10;
- What are the key components of HBase?
This is one of the most common interview questions.
Region Server
Region server contains HBase tables that are divided horizontally into “Regions” based on their key values. It runs on every node and decides the size of the region. Each region server is a worker node that handles read, writes, updates, and delete request from clients.
HMaster
This assigns regions to RegionServers for load balancing, and monitors and manages the Hadoop cluster. Whenever a client wants to change the schema and any of the metadata operations, HMaster is used.
ZooKeeper
This provides a distributed coordination service to maintain server state in the cluster. It looks into which servers are alive and available, and provides server failure notifications. Region servers send their statuses to ZooKeeper indicating if they are ready to reading and write operations.
- Explain what row key and column families in HBase is.
The row key is a primary key for an HBase table. It also allows logical grouping of cells and ensures that all cells with the same row key are located on the same server.
Column families consist of a group of columns that are defined during table creation, and each column family has certain column qualifiers that a delimiter separates.
- Why do we need to disable a table in HBase and how do you do it?
The HBase table is disabled to allow modifications to its settings. When a table is disabled, it cannot be accessed through the scan command.
To disable the employee table, use the command:
disable ‘employee_table’
To check if the table is disabled, use the command:
is_disabled ‘employee_table’
- Write the code needed to open a connection in HBase.
The following code is used to open a connection in HBase:
Configuration myConf = HBaseConfiguration.create();
HTableInterface usersTable = new HTable(myConf, “users”);
- What does replication mean in terms of HBase?
The replication feature in HBase provides a mechanism to copy data between clusters. This feature can be used as a disaster recovery solution that provides high availability for HBase.
The following commands alter the hbase1 table and set the replication_scope to 1. A replication_scope of 0 indicates that the table is not replicated.
disable ‘hbase1’
alter ‘hbase1’, {NAME => ‘family_name’, REPLICATION_SCOPE => ‘1’}
enable ‘hbase1’
- Can you import/export in an HBase table?
Yes, it is possible to import and export tables from one HBase cluster to another.
HBase export utility:
hbase org.apache.hadoop.hbase.mapreduce.Export “table name” “target export location”
Example: hbase org.apache.hadoop.hbase.mapreduce.Export “employee_table” “/export/employee_table”
HBase import utility:
create ‘emp_table_import’, {NAME => ‘myfam’, VERSIONS => 10}
hbase org.apache.hadoop.hbase.mapreduce.Import “table name” “target import location”
Example: create ‘emp_table_import’, {NAME => ‘myfam’, VERSIONS => 10}
hbase org.apache.hadoop.hbase.mapreduce.Import “emp_table_import” “/export/employee_table”
- What is compaction in HBase?
Compaction is the process of merging HBase files into a single file. This is done to reduce the amount of memory required to store the files and the number of disk seeks needed. Once the files are merged, the original files are deleted.
- How does Bloom filter work?
The HBase Bloom filter is a mechanism to test whether an HFile contains a specific row or row-col cell. The Bloom filter is named after its creator, Burton Howard Bloom. It is a data structure that predicts whether a given element is a member of a set of data. These filters provide an in-memory index structure that reduces disk reads and determines the probability of finding a row in a particular file.
- Does HBase have any concept of the namespace?
A namespace is a logical grouping of tables, analogous to a database in RDBMS. You can create the HBase namespace to the schema of the RDBMS database.
To create a namespace, use the command:
create_namespace ‘namespace name’
To list all the tables that are members of the namespace, use the command: list_namespace_tables ‘default’
To list all the namespaces, use the command:
list_namespace
- How does the Write Ahead Log (WAL) help when a RegionServer crashes?
If a RegionServer hosting a MemStore crash, the data that existed in memory, but not yet persisted, is lost. HBase recovers against that by writing to the WAL before the write completes. The HBase cluster keeps a WAL to record changes as they happen. If HBase goes down, replaying the WAL will recover data that was not yet flushed from the MemStore to the HFile.
- Write the HBase command to list the contents and update the column families of a table.
The following code is used to list the contents of an HBase table:
scan ‘table_name’
Example: scan ‘employee_table’
To update column families in the table, use the following command:
alter ‘table_name’, ‘column_family_name’
Example: alter ‘employee_table’, ‘emp_address’
- What are catalog tables in HBase?
The catalog has two tables: hbasemeta and -ROOT-
The catalog table hbase:meta exists as an HBase table and is filtered out of the HBase shell’s list command. It keeps a list of all the regions in the system and the location of hbase:meta is
stored in ZooKeeper. The -ROOT- table keeps track of the location of the .META table.
- What is hotspotting in HBase and how can it be avoided?
In HBase, all read and write requests should be uniformly distributed across all of the regions in the RegionServers. Hotspotting occurs when a given region serviced by a single RegionServer receives most or all of the read or write requests.
Hotspotting can be avoided by designing the row key in such a way that data is written should go to multiple regions across the cluster. Below are the techniques to do so:
- Salting
- Hashing
- Reversing the key
- How is Sqoop different from Flume?
Sqoop | Flume |
· Sqoop works with RDBMS and NoSQL databases to import and export data · Loading data in Sqoop is not event-driven · Works with structured data sources and Sqoop connectors are used to fetch data from them · It imports data from RDBMS into HDFS and exports it back to RDBMS | · Flume works with streaming data that is generated continuously in the Hadoop environment. Example: log files · Loading data in Flume is completely event-driven · Fetches streaming data, like tweets or log files, from web servers or application servers · Data flows from multiple channels into HDFS |
- What are the default file formats to import data using Sqoop?
The default Hadoop file formats are Delimited Text File Format and SequenceFile Format. Let us understand each of them individually:
Delimited Text File Format
This is the default import format and can be specified explicitly using the –as-textfile argument. This argument will write string-based representations of each record to the output files, with delimiter characters between individual columns and rows.
1,here is a message,2010-05-01
2,strive to learn,2010-01-01
3,third message,2009-11-12
SequenceFile Format
SequenceFile is a binary format that stores individual records in custom record-specific data types. These data types are manifested as Java classes. Sqoop will automatically generate
these data types for you. This format supports the exact storage of all data in binary representations and is appropriate for storing binary data.
- What is the importance of the eval tool in Sqoop?
The Sqoop eval tool allows users to execute user-defined queries against respective database servers and preview the result in the console.
- Explain how does Sqoop imports and exports data between RDBMS and HDFS with its architecture.
- Introspect database to gather metadata (primary key information)
- Sqoop divides the input dataset into splits and uses individual map tasks to push the splits to
HDFS
- Introspect database to gather metadata (primary key information)
- Sqoop divides the input dataset into splits and uses individual map tasks to push the splits to RDBMS. Sqoop will export Hadoop files back to RDBMS tables.
- Suppose you have a database “test_db” in MySQL. Write the command to connect this database and import tables to Sqoop.
The following commands show how to import the test_db database and test_demo table, and how to present it to Sqoop.
- Explain how to export a table back to RDBMS with an example.
Suppose there is a “departments” table in “retail_db” that is already imported into Sqoop and you need to export this table back to RDBMS.
- i) Create a new “dept” table to export in RDBMS (MySQL)
- ii) Export “departments” table to the “dept” table
- What is the role of the JDBC driver in the Sqoop setup? Is the JDBC driver enough to connect Sqoop to the database?
JDBC driver is a standard Java API used for accessing different databases in RDBMS using Sqoop. Each database vendor is responsible for writing their own implementation that will communicate with the corresponding database with its native protocol. Each user needs to download the drivers separately and install them onto Sqoop prior to its use.
JDBC driver alone is not enough to connect Sqoop to the database. We also need connectors to interact with different databases. A connector is a pluggable piece that is used to fetch metadata and allows Sqoop to overcome the differences in SQL dialects supported by various databases, along with providing optimized data transfer.
- How will you update the columns that are already exported? Write a Sqoop command to show all the databases in the MySQL server?
To update a column of a table which is already exported, we use the command:
–update-key parameter
The following is an example:
sqoop export –connect
jdbc:mysql://localhost/dbname –username root
–password cloudera –export-dir /input/dir
–table test_demo –-fields-terminated-by “,”
–update-key column_name
- What is Codegen in Sqoop?
The Codegen tool in Sqoop generates the Data Access Object (DAO) Java classes that encapsulate and interpret imported records.
The following example generates Java code for an “employee” table in the “testdb” database.
$ sqoop codegen \
–connect jdbc:mysql://localhost/testdb \
–username root \
–table employee
- Can Sqoop be used to convert data in different formats? If not, which tools can be used for this purpose?
Yes, Sqoop can be used to convert data into different formats. This depends on the different arguments that are used for importing.