Hive Interview Q & A

Part 1:
1) Explain what is Hive?
Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System (HDFS). It is a data warehouse framework for querying and analysis of data that is stored in HDFS. Hive is an open-source-software that lets programmers analyze large data sets on Hadoop.
2) When to use Hive?
- Hive is useful when making data warehouse applications
- When you are dealing with static data instead of dynamic data
- When application is on high latency (high response time)
- When a large data set is maintained
- When we are using queries instead of scripting
3) Mention what are the different modes of Hive?
Depending on the size of data nodes in Hadoop, Hive can operate in two modes.
These modes are,
- Local mode
- Map reduce mode
4) Mention when to use Map reduce mode?
Map reduce mode is used when,
- It will perform on large amount of data sets and query going to execute in a parallel way
- Hadoop has multiple data nodes, and data is distributed across different node we use Hive in this mode
- Processing large data sets with better performance needs to be achieved
5) Mention key components of Hive Architecture?
Key components of Hive Architecture includes,
- User Interface
- Compiler
- Metastore
- Driver
- Execute Engine
6) Mention what are the different types of tables available in Hive?
There are two types of tables available in Hive.
- Managed table: In managed table, both the data and schema are under control of Hive
- External table: In the external table, only the schema is under the control of Hive.
7) Explain what is Metastore in Hive?
Metastore is a central repository in Hive. It is used for storing schema information or metadata in the external database.
8) Mention what Hive is composed of ?
Hive consists of 3 main parts,
- Hive Clients
- Hive Services
- Hive Storage and Computing
9) Mention what are the type of database does Hive support ?
For single user metadata storage, Hive uses derby database and for multiple user Metadata or shared Metadata case Hive uses MYSQL.
10) Mention Hive default read and write classes?
Hive default read and write classes are
- TextInputFormat/HiveIgnoreKeyTextOutputFormat
- SequenceFileInputFormat/SequenceFileOutputFormat
11) Mention what are the different modes of Hive?
Different modes of Hive depends on the size of data nodes in Hadoop.
These modes are,
- Local mode
- Map reduce mode
12) Why is Hive not suitable for OLTP systems?
Hive is not suitable for OLTP systems because it does not provide insert and update function at the row level.
13) Mention what is the difference between Hbase and Hive?
Difference between Hbase and Hive is,
- Hive enables most of the SQL queries, but HBase does not allow SQL queries
- Hive does not support record level insert, update, and delete operations on table
- Hive is a data warehouse framework whereas HBase is NoSQL database
- Hive run on the top of MapReduce, HBase runs on the top of HDFS
14) Explain what is a Hive variable? What for we use it?
Hive variable is created in the Hive environment that can be referenced by Hive scripts. It is used to pass some values to the hive queries when the query starts executing.
15) Mention what is ObjectInspector functionality in Hive?
ObjectInspector functionality in Hive is used to analyze the internal structure of the columns, rows, and complex objects. It allows to access the internal fields inside the objects.
16) Mention what is (HS2) HiveServer2?
It is a server interface that performs following functions.
- It allows remote clients to execute queries against Hive
- Retrieve the results of mentioned queries
Some advanced features Based on Thrift RPC in its latest version include
- Multi-client concurrency
- Authentication
17) Mention what Hive query processor does?
Hive query processor convert graph of MapReduce jobs with the execution time framework. So that the jobs can be executed in the order of dependencies.
18) Mention what are the components of a Hive query processor?
The components of a Hive query processor include,
- Logical Plan Generation
- Physical Plan Generation
- Execution Engine
- Operators
- UDF’s and UDAF’s
- Optimizer
- Parser
- Semantic Analyzer
- Type Checking
19) Mention what is Partitions in Hive?
Hive organizes tables into partitions.
- It is one of the ways of dividing tables into different parts based on partition keys.
- Partition is helpful when the table has one or more Partition keys.
- Partition keys are basic elements for determining how the data is stored in the table.
20) Mention when to choose “Internal Table” and “External Table” in Hive?
In Hive you can choose internal table,
- If the processing data available in local file system
- If we want Hive to manage the complete lifecycle of data including the deletion
You can choose External table,
- If processing data available in HDFS
- Useful when the files are being used outside of Hive
21) Mention if we can name view same as the name of a Hive table?
No. The name of a view must be unique compared to all other tables and as views present in the same database.
22) Mention what are views in Hive?
In Hive, Views are Similar to tables. They are generated based on the requirements.
- We can save any result set data as a view in Hive
- Usage is similar to as views used in SQL
- All type of DML operations can be performed on a view
23) Explain how Hive Deserialize and serialize the data?
Usually, while read/write the data, the user first communicate with inputformat. Then it connects with Record reader to read/write record. To serialize the data, the data goes to row. Here deserialized custom serde use object inspector to deserialize the data in fields.
24) What is Buckets in Hive?
- The data present in the partitions can be divided further into Buckets
- The division is performed based on Hash of particular columns that is selected in the table.
25) In Hive, how can you enable buckets?
In Hive, you can enable buckets by using the following command,
1 | set.hive.enforce.bucketing=true; |
26) In Hive, can you overwrite Hadoop MapReduce configuration in Hive?
Yes, you can overwrite Hadoop MapReduce configuration in Hive.
27) Explain how can you change a column data type in Hive?
You can change a column data type in Hive by using command,
1 | ALTER TABLE table_name CHANGE column_name column_name new_datatype; |
28) Mention what the difference between orders is by and sort by in Hive?
- SORT BY will sort the data within each reducer. You can use any number of reducers for SORT BY operation.
- ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in hive uses a single
29) Explain when to use explode in Hive?
Hadoop developers sometimes take an array as input and convert into a separate table row. To convert complex data types into desired table formats, Hive use explode.
30) Mention how can you stop a partition form being queried?
You can stop a partition form being queried by using the ENABLE OFFLINE clause with ALTER TABLE statement.
Part 2:
- WHAT APPLICATIONS ARE SUPPORTED BY HIVE?
Hive supports client applications based on Java, PHP, Python, C, and Ruby coding languages.
- WHAT ARE THE DIFFERENT TABLES AVAILABLE IN HIVE?
There are two types of tables available in Hive – managed and external.
- WHAT IS THE DIFFERENCE BETWEEN EXTERNAL AND MANAGED TABLES?
While external tables give data control to hive but not control of a schema, managed tables give both schema and data control.
- WHERE DOES THE DATA OF A HIVETABLE GET STORED?
The Hive table gets stored in an HDFS directory – /user/hive/warehouse, by default. You can adjust it by setting the desired directory in the configuration parameter hive.metastore.warehouse.dir in hive-site.xml.
- CAN HIVEBE USED IN OLTP SYSTEMS?
Since Hive does not support row-level data insertion, it is not suitable for use in OLTP systems.
- CAN A TABLE NAME BE CHANGED IN HIVE?
Yes, you can change a table name in Hive. You can rename a table name by using: Alter Table table_name RENAME TO new_name.
- WHERE IS HIVETABLE DATA STORED?
Hive table data is stored in an HDFS directory by default – user/hive/warehouse. This can be altered.
- CAN THE DEFAULT LOCATION OF A MANAGED TABLE BE CHANGED IN HIVE?
Yes, the default managed table location can be changed in Hive by using the LOCATION ‘<hdfs_path>’ clause.
- WHAT IS A HIVEMETASTORE?
A Metastore is a relational database that stores the metadata of Hive partitions, tables, databases, and so on.
- WHAT ARE THE TYPES OF META STORES?
Local and Remote Meta stores are the two types of Hive Meta stores.
- WHAT IS THE DIFFERENCE BETWEEN LOCAL AND REMOTE META STORES?
Local Meta stores run on the same Java Virtual Machine (JVM) as the Hive service whereas remote Meta stores run on a separate, distinct JVM.
- WHAT IS THE DEFAULT APACHE HIVE METASTORE DATABASE?
The default database for Meta store is the embedded Derby database provided by Hive which is backed by the local disk.
- CAN MULTIPLE USERS USE ONE METASTORE?
No, Meta store sharing is not supported by Hive.
- WHAT ARE THE THREE DIFFERENT MODES IN WHICH HIVECAN BE OPERATED?
The three modes in which Hive can be operated are Local mode, distributed mode, and pseudo-distributed mode.
- IS THERE A DATA TYPE IN HIVETO STORE DATE INFORMATION?
The TIMESTAMP data type in Hive stores all data information in the java.sql.timestamp format.
- WHY IS PARTITIONING USED IN HIVE?
Partitioning is used in Hive as it allows for the reduction of query latency. Instead of scanning entire tables, only relevant partitions and corresponding datasets are scanned.
- WHAT ARE THE HIVECOLLECTION DATA TYPES?
ARRAY, MAP, AND STRUCT are the three Hive collection data types.
- IS IT POSSIBLE TO RUN UNIX SHELL COMMANDS IN HIVE?
Yes, one can run shell commands in Hive by adding a ‘!’ before the command.
- IS IT POSSIBLE TO EXECUTE HIVE QUERIES FROM A SCRIPT FILE?
Yes, one can do so with the help of a source command. For example – Hive> source /path/queryfile.hql
- WHAT IS A .HIVERC FILE?
It is a file that consists of a list of commands that need to be run when the Command Line Input is initiated.
- HOW CAN YOU CHECK IF A SPECIFIC PARTITION EXISTS?
Use the following command: SHOW PARTITIONS table_name PARTITION (partitioned_column=’partition_value’)
- IF YOU HAD TO LIST ALL DATABASES THAT BEGAN WITH THE LETTER ‘C’, HOW WOULD YOU DO IT?
By using the following command: SHOW DATABASES LIKE ‘c.*’
- IS IT POSSIBLE TO DELETE DBPROPERTY IN HIVE?
No, there is no way to delete the DBPROPERTY.
- WHICH JAVA CLASS HANDLES THE INPUT RECORD ENCODING INTO FILES THAT STORE HIVE TABLES?
The ‘org.apache.hadoop.mapred.TextInputFormat’ class.
- WHICH JAVA CLASS HANDLES OUTPUT RECORD ENCODING INTO HIVE QUERY FILES?
The ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’ class.
- WHEN A HIVETABLE PARTITION IS POINTED TO A NEW DIRECTORY, WHAT HAPPENS TO THE DATA?
The data remains in the old directory and needs to be transferred manually.
- DO YOU SAVE SPACE IN THE HDFS BY ARCHIVING HIVETABLES?
No, archiving Hive tables only helps reduce the number of files that make for easier management of data.
- HOW CAN YOU STOP A PARTITION FROM BEING ACCESSED IN A QUERY?
Use the ENABLE OFFLINE clause along with the ALTER TABLE command.
- WHAT IS A TABLE GENERATING FUNCTION ON HIVE?
MapReduce is a programming framework that allows Hive to divide large datasets into smaller units and process them parallelly.
- CAN YOU AVOID MAPREDUCE ON HIVE?
You can make Hive avoid MapReduce to return query results by setting the hive.exec.mode.local.auto property to ‘true’.
- CAN A CARTESIAN JOIN BE CREATED BETWEEN TWO HIVETABLES?
This is not possible as it cannot be implemented in MapReduce programming.
- WHAT IS A VIEW IN HIVE?
A view is a logical construct that allows search queries to be treated as tables.
- CAN THE NAME OF A VIEW BE THE SAME AS A HIVETABLE NAME?
No, the name of the view must always be unique in the database.
- CAN WE USE THE LOAD OR INSERT COMMAND TO VIEW?
No, these commands cannot be used with respect to a view in Hive.
- WHAT IS INDEXING IN HIVE?
Hive indexing is a query optimization technique to reduce the time needed to access a column or a set of columns within a Hive database.
- ARE MULTI-LINE COMMENTS SUPPORTED BY HIVE?
No, multi-line comments are supported by Hive.
- HOW CAN YOU VIEW THE INDEXES OF A HIVE TABLE?
By using the following command: SHOW INDEX ON table_name
- WHAT IS THE HIVEOBJECTINSPECTOR FUNCTION?
It helps to analyze the structure of individual columns and rows and provides access to the complex objects that are stored within the database.
- WHAT IS BUCKETING?
Bucketing is the process of hashing the values in a column into several user-defined buckets which helps avoid over-partitioning.
- HOW IS BUCKETING HELPFUL?
Bucketing helps optimize the sampling process and shortens the query response time.
- CAN YOU SPECIFY THE NAME OF THE TABLE CREATOR IN HIVE?
Yes, by using the TBLPROPERTIES clause. For example – TBLPROPERTIES (‘creator’= ‘john’)
- WHAT IS HCATALOG?
Hcatalog is a tool that helps to share data structures with other external systems in the Hadoop ecosystem.
- WHAT IS UDF IN HIVE?
UDF is a user-designed function created with a Java program to address a specific function that is not part of the existing Hive functions.
- WHAT DOES /*STREAM TABLE (TABLE_NAME)*/ DO?
It is a query hint that allows for a table to be streamed into memory before a query is executed.
- WHAT ARE THE LIMITATIONS OF HIVE?
Hive has the following limitations:
- Real-time queries cannot be executed and it has no row-level support.
- Hive cannot be used for online transaction processing.
- WHY DO YOU NEED A HCATOLOG?
For sharing Data structures with external systems, H catalog is a necessary tool. It offers access to the Hive Meta store for reading and writing data in a Hive data warehouse.
- NAME THE COMPONENTS OF A HIVEQUERY PROCESSOR?
Following are the components of a Hive query processor:
- Logical Plan of Generation.
- Physical Plan of Generation.
- Execution Engine.
- UDF’s and UDAF.
- Semantic Analyzer.
- Type Checking.
- HOW HIVEDISTRIBUTES THE ROWS INTO BUCKETS?
Hive uses the formula: hash_function (bucketing_column) modulo (num_of_buckets) to calculate the row’s bucket number. Here, hash_function is based on the Data type of the column. The hash_function is for integer data type:
hash_function (int_type_column)= value of int_type_column
- HOW DO ORC FORMAT TABLES HELP HIVETO ENHANCE THE PERFORMANCE?
You can easily store the Hive Data with the ORC (Optimized Row Column) format as it helps to streamline several limitations.
- WHAT ARE THE DIFFERENT COMPONENTS OF A HIVEARCHITECTURE?
Following are the five components of a Hive Architecture:
- User Interface:It helps the user to send queries to the Hive system and other operations. The user interface provides hive Web UI, Hive Command-Line and Hive HDInsight.
- Driver:It designs a session handle for the query, and then the queries are sent to the compiler for the execution plan.
- Metastore: It contains the organized data and information on various warehouse tables and partitions.
- Compiler: It creates the execution plan for the queries, performs semantic analysis on different query blocks, and generates query expression.
- Execution Engine:It implements the execution plans created by the compiler.
Part 3:
- What kind of data warehouse application is suitable for Hive? What are the types of tables in Hive? ↑
Hive is not considered a full database. The design rules and regulations of Hadoop and HDFS have put restrictions on what Hive can do. However, Hive is most suitable for data warehouse applications because it:
- Analyzes relatively static data.
- Has less responsive time.
- does not make rapid changes in data.
- Although Hive doesn’t provide fundamental features required for Online Transaction Processing (OLTP), it is suitable for data warehouse applications in large datasets.
There are two types of tables in Hive:
- Managed tables
- External tables
- Explain the SMB Join in Hive? ↑
In SMB join in Hive, every mapper peruses a bucket from the first table and the relating bucket from the second table, and after that, a merge sort join is performed. Sort Merge Bucket (SMB) joins in the hive is for the most utilized as there are no restrictions on file or segment or table join. SMB join can best be utilized when the tables are huge. In SMB join the sections are bucketed and arranged to utilize the join segments. All tables ought to have a similar number of buckets in SMB join.
- How HIVE is different from RDBMS? ↑
- RDBMS supports schema on Write whereas Hive provides schema on Read.
- In Hive, we can write once but in RDBMS we can write as many times as we want.
- Hive can handle big datasets whereas RDBMS can’t handle beyond 10TB.
- Hive is highly scalable but scalability in RDBMS costs a lost.
- Hive has a feature of Bucketing which is not there in RDBMS.
- What are the types of database does hive support? ↑
For single user metadata storage, Hive uses derby database and for multiple user Metadata or shared Metadata case Hive uses MYSQL.
- In Hive, how can you enable buckets? ↑
In Hive, you can enable buckets by using the following command: set.hive.enforce.bucketing=true;
- Is Hive suitable to be used for OLTP systems? Why? ↑
No, it is not suitable for OLTP system since it does not offer insert and update at the row level.
- What is the Object Inspector functionality is in Hive? ↑
In Hive the analysis of the inner structure of the segments, columns, and complex items are finished utilizing Object Inspector functionality. Question Inspector functionality makes availability to the inner fields, which are present inside the objects.
- What are limitations of Hive? ↑
- We cannot perform real-time queries with Hive. Also, it does not offer row-level updates.
- For interactive data browsing Hive offers acceptable latency.
- Hive is not the right choice for online transaction processing.
- What are the different Modes in the Hive? ↑
Sometimes interviewers like to ask these basic questions to see how confident you are when it comes to your Hive knowledge. Answer by saying that Hive can sometimes operate in two modes, which are MapReduce mode and local mode. Explain that this depends on the size of the Data Nodes in Hadoop.
- What is Hive Bucketing? ↑
When performing queries on large datasets in Hive, bucketing can offer better structure to Hive tables. You’ll also want to take your answer a step further by explaining some of the specific bucketing features, as well as some of the advantages of bucketing in Hive. For example, bucketing can give programmers more flexibility when it comes to record-keeping and can make it easier to debug large datasets when needed.
- What is the difference between partition and bucketing? ↑
The main aim of both Partitioning and Bucketing is execute the query more efficiently. When you are creating a table the slices are fixed in the partitioning the table.
- Where does the data of a Hive table gets stored? ↑
In an HDFS directory – /user/hive/warehouse, the Hive table is stored, by default only. Moreover, by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml, one can change it.
- How data transfer happens from HDFS to Hive? ↑
If data is already present in HDFS then the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. So the user simply has to define the table using the keyword external that creates the table definition in the hive metastore.
- What does the Hive query processor do? ↑
Hive query processor converts graph of MapReduce jobs with the execution time framework so that the jobs can be executed in the order of dependencies.
- Explain about SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY in Hive. ↑
- SORT BY – Data is ordered at each of ‘N’ reducers where the reducers can have overlapping range of data.
- ORDER BY- This is similar to the ORDER BY in SQL where total ordering of data takes place by passing it to a single reducer.
- DISTRUBUTE BY – It is used to distribute the rows among the reducers. Rows that have the same distribute by columns will go to the same reducer.
- CLUSTER BY- It is a combination of DISTRIBUTE BY and SORT BY where each of the N reducers gets non overlapping range of data which is then sorted by those ranges at the respective reducers.
- What is the difference between local and remote metastore? ↑
Local Metastore: It is the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM. Either on the same machine or on a remote machine.
Remote Metastore: In this configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM.
- Which classes are used in Hive to Read and Write HDFS Files? ↑
Following classes are used by Hive to read and write HDFS files:
- TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format.
- SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in hadoop SequenceFile format.
- Explain the functionality of ObjectInspector. ↑
ObjectInspector helps analyze the internal structure of a row object and the individual structure of columns in Hive. It also provides a uniform way to access complex objects that can be stored in multiple formats in the memory.
- An instance of Java class.
- A standard Java object.
- A lazily initialized object
ObjectInspector tells the structure of the object and also the ways to access the internal fields inside the object.
- What is ObjectInspector functionality in Hive? ↑
Hive uses ObjectInspector to analyze the internal structure of the rows, columns and complex objects. Additionally gives us ways to access the internal fields inside the object. It not only process common data-types like int, bigint, STRING, but also process complex data-types like arrays, maps, struts and union.
- How does bucketing help in the faster execution of queries? ↑
If you have to join two large tables, you can go for reduce side join. But if both the tables have the same number of buckets or same multiples of buckets and also sorted on the same column there is a possibility of SMBMJ in which all the joins take place in the map phase itself by matching the corresponding buckets.
Buckets are basically files that are created inside the HDFS directory.
There are different properties which you need to set for bucket map joins and they are as follows:
Set hive.enforce.sortmergebucketmapjoin = false; set hive.auto.convert.sortmerge.join = false; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true;
- Why will mapreduce not run if you run select * from table in hive? ↑
When you perform a “select * from “, Hive fetches the whole data from file as a FetchTask rather than a mapreduce task which just dumps the data as it is without doing anything on it. This is similar to “hadoop dfs -text “. However, while using “select from “, Hive requires a map-reduce job since it needs to extract the ‘column’ from each row by parsing it from the file it loads.
- What is Hive Meta Store? ↑
Meta Store is a central repository of Hive that allows to store Meta data in external database. By default hive store Meta data in Derby database, but you can store in MySql, Oracle depends on project.
- What are the three different modes in which hive can be run? ↑
- Local mode
- Distributed mode
- Pseudo distributed mode
- How can you prevent a large job from running for a long time? ↑
This can be achieved by setting the MapReduce jobs to execute in strict mode set hive.mapred.mode=strict; the strict mode ensures that the queries on partitioned tables cannot execute without defining a WHERE clause.
- When do we use explode in Hive? ↑
Hadoop developers sometimes take an array as input and convert into a separate table row. To convert complex data types into desired table formats, Hive use explode.
- What are the different components of a Hive architecture? ↑
Hive Architecture consists of a –
- User Interface – UI component of the Hive architecture calls the execute interface to the driver.
- Driver – Driver creates a session handle to the query and sends the query to the compiler to generate an execution plan for it.
- Metastore – Sends the metadata to the compiler for the execution of the query on receiving the send Meta Data request.
- Compiler- Compiler generates the execution plan which is a DAG of stages where each stage is either a metadata operation, a map or reduce job or an operation on HDFS.
- Execute Engine- Execution engine is responsible for submitting each of these stages to the relevant components by managing the dependencies between the various stages in the execution plan generated by the compiler.
- How can you connect an application, if you run Hive as a server? ↑
When running Hive as a server, the application can be connected in one of the 3 ways-
- ODBC Driver-This supports the ODBC protocol
- JDBC Driver- This supports the JDBC protocol
- Thrift Client- This client can be used to make calls to all hive commands using different programming language like PHP, Python, Java, C++ and Ruby.
- Can we LOAD data into a view? ↑
No. A view cannot be the target of an INSERT or LOAD statement.
- Is it possible to add 100 nodes when we already have 100 nodes in Hive? If yes, how? ↑
Yes, we can add the nodes by following the below steps:
- Step 1: Take a new system; create a new username and password
- Step 2: Install SSH and with the master node setup SSH connections
- Step 3: Add ssh public_rsa id key to the authorized keys file
- Step 4: Add the new DataNode hostname, IP address, and other details in /etc/hosts slaves file: 192.168.1.102 slave3.in slave3
- Step 5: Start the DataNode on a new node
- Step 6: Login to the new node like suhadoop or: ssh -X hadoop@192.168.1.103
- Step 7: Start HDFS of the newly added slave node by using the following command: ./bin/hadoop-daemon.sh start data node
- Step 8: Check the output of the jps command on the new node
- Can Hive process any type of data formats? ↑
Yes, Hive uses the SerDe interface for IO operations. Different SerDe interfaces can read and write any type of data. If normal directly process the data whereas different type of data is in the Hadoop, Hive use different SerDe interface to process such data.
- How can you stop a partition form being queried? ↑
You can stop a partition form being queried by using the ENABLE OFFLINE clause with ALTER TABLE statement.
- What is a Hive variable? What do we use it for? ↑
Hive variables are basically created in the Hive environment that is referenced by Hive scripting languages. They allow to pass some values to a Hive query when the query starts executing. They use the source command.
- What is SerDe in Apache Hive? ↑
A SerDe is a short name for a Serializer Deserializer. Hive uses SerDe to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System format that data is stored in. Users are able to write files to HDFS with whatever tools/mechanism takes their fancy(“CREATE EXTERNAL TABLE” or “LOAD DATA INPATH,” ) and use Hive to correctly “parse” that file format in a way that can be used by Hive. A SerDe is a powerful (and customizable) mechanism that Hive uses to “parse” data stored in HDFS to be used by Hive.
- Whenever we run a Hive query, a new metastore_db is created. Why? ↑
A local metastore is created when we run Hive in an embedded mode. Before creating, it checks whether the metastore exists or not, and this metastore property is defined in the configuration file, hive-site.xml. The property is: javax.jdo.option.ConnectionURL with the default value: jdbc:derby:;databaseName=metastore_db;create=true. Therefore, we have to change the behavior of the location to an absolute path so that from that location the metastore can be used.
- Can we change the data type of a column in a hive table? ↑
Using REPLACE column option: ALTER TABLE table_name REPLACE COLUMNS
- Why does Hive not store metadata information in HDFS? ↑
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The main reason for choosing RDBMS is to achieve low latency because HDFS read/write operations are time consuming processes.
- How does Hive deserialize and serialize the data? ↑
Usually, while read/write the data, the user first communicate with inputformat. Then it connects with Record reader to read/write record. To serialize the data, the data goes to row. Here deserialized custom serde use object inspector to deserialize the data in fields.
- What is RegexSerDe? ↑
Regex stands for a regular expression. Whenever you want to have a kind of pattern matching, based on the pattern matching, you have to store the fields. RegexSerDe is present in org.apache.hadoop.hive.contrib.serde2.RegexSerDe.
In the SerDeproperties, you have to define your input pattern and output fields. For example, you have to get the column values from line xyz/pq@def if you want to take xyz, pq and def separately.
- While loading data into a hive table using the LOAD DATA clause, how do you specify it is a hdfs file and not a local file? ↑
By Omitting the LOCAL CLAUSE in the LOAD DATA statement.
- Explain about the different types of partitioning in Hive? ↑
Partitioning in Hive helps prune the data when executing the queries to speed up processing. Partitions are created when data is inserted into the table. In static partitions, the name of the partition is hardcoded into the insert statement whereas in a dynamic partition, Hive automatically identifies the partition based on the value of the partition field.
Based on how data is loaded into the table, requirements for data and the format in which data is produced at source- static or dynamic partition can be chosen. In dynamic partitions the complete data in the file is read and is partitioned through a MapReduce job based into the tables based on a particular field in the file. Dynamic partitions are usually helpful during ETL flows in the data pipeline.
When loading data from huge files, static partitions are preferred over dynamic partitions as they save time in loading data. The partition is added to the table and then the file is moved into the static partition. The partition column value can be obtained from the file name without having to read the complete file.
- What is the significance of ‘IF EXISTS” clause while dropping a table? ↑
When we issue the command DROP TABLE IF EXISTS table_name, Hive throws an error if the table being dropped does not exist in the first place.
- How can Hive avoid map reduce? ↑
If we set the property hive.exec.mode.local.auto to true then hive will avoid map reduce to fetch query results.
- What is the relationship between MapReduce and Hive? or How Map reduce jobs submits on the cluster? ↑
Hive provides no additional capabilities to MapReduce. The programs are executed as MapReduce jobs via the interpreter. The Interpreter runs on a client machine which rurns HiveQL queries into MapReduce jobs. Framework submits those jobs onto the cluster.
- What is ObjectInspector functionality? ↑
Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns. ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:
Instance of a Java class (Thrift or native Java):
- A standard Java object (we use java.util.List to represent Struct and Array, and use java.util.Map to represent Map).
- A lazily-initialized object (For example, a Struct of string fields stored in a single Java string object with starting offset for each field) A complex object can be represented by a pair of ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.
- Suppose that I want to monitor all the open and aborted transactions in the system along with the transaction id and the transaction state. Can this be achieved using Apache Hive? ↑
Hive 0.13.0 and above version support SHOW TRANSACTIONS command that helps administrators monitor various hive transactions.
- Can a partition be archived? What are the advantages and disadvantages? ↑
Yes. A partition can be archived. Advantage is it decreases the number of files stored in name node and the archived file can be queried using hive. The disadvantage is it will cause less efficient query and does not offer any space savings.
- Does the archiving of Hive tables save space in HDFS? ↑
No. It only reduces the number of files which becomes easier for name node to manage.
- Does Hive support record level Insert, delete or update? ↑
Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.
- What are the default record and field delimiter used for hive text files? ↑
The default record delimiter is − \n. And the filed delimiters are − \001,\002,\003.
- What is difference between static and dynamic partition of a table? ↑
To prune data during query, partition can minimize the query time. The partition is created when the data is inserted into table. Static partition can insert individual rows whereas Dynamic partition can process entire table based on a particular column. At least one static partition is must to create any (static, dynamic) partition. If you are partitioning a large datasets, doing sort of an ETL flow Dynamic partition recommendable.
- Why do we perform partitioning in Hive? ↑
In a Hive table, Partitioning provides granularity. Hence, by scanning only relevant partitioned data instead of the whole dataset it reduces the query latency.
- How does partitioning help in the faster execution of queries? ↑
With the help of partitioning, a subdirectory will be created with the name of the partitioned column and when you perform a query using the WHERE clause, only the particular sub-directory will be scanned instead of scanning the whole table. This gives you faster execution of queries.
- Can you list few commonly used Hive services? ↑
- Command Line Interface (cli)
- Hive Web Interface (hwi)
- HiveServer (hiveserver)
- Printing the contents of an RC file using the tool rcfilecat.
- Jar
- Metastore
- What is the default maximum dynamic partition that can be created by a mapper/reducer? How can you change it? ↑
By default the number of maximum partition that can be created by a mapper or reducer is set to 100. One can change it by issuing the following command:
SET hive.exec.max.dynamic.partitions.pernode = <value>.
- Why do we need buckets? ↑
Basically, for performing bucketing to a partition there are two main reasons:
- A map side join requires the data belonging to a unique join key to be present in the same partition.
- It allows us to decrease the query time. Also, makes the sampling process more efficient.
- Can we name view the same as the name of a Hive table? ↑
No. The name of a view must be unique compared to all other tables and as views present in the same database.
- What Options are Available When It Comes to Attaching Applications to the Hive Server? ↑
Explain the three different ways (Thrift Client, JDBC Driver, and ODBC Driver) you can connect applications to the Hive Server. You’ll also want to explain the purpose for each option: for example, using JDBC will support the JDBC protocol.
- When should we use SORT BY instead of ORDER BY? ↑
Despite ORDER BY we should use SORT BY. Especially while we have to sort huge datasets. The reason is SORT BY clause sorts the data using multiple reducers. ORDER BY sorts all of the data together using a single reducer. Hence, using ORDER BY will take a lot of time to execute a large number of inputs.
- What are the uses of Hive Explode? ↑
Hadoop Developers consider an array as their input and convert it into a separate table row. To convert complicated data types into desired table formats, Hive uses Explode.
- Can we run UNIX shell commands from Hive? Can Hive queries be executed from script files? If yes, how? ↑
Yes, we can run UNIX shell commands from Hive using an ‘!‘ mark before the command. For example, !pwd at Hive prompt will display the current directory. We can execute Hive queries from the script files using the source command.
- How is ORC file format optimised for data storage and analysis? ↑
ORC stores collections of rows in one file and within the collection the row data will be stored in a columnar format. With columnar format, it is very easy to compress, thus reducing a lot of storage cost. While querying also, it queries the particular column instead of querying the whole row as the records are stored in columnar format. ORC has got indexing on every block based on the statistics min, max, sum, count on columns so when you query, it will skip the blocks based on the indexing.
- What is the difference between Internal and External Table? ↑
- Internal table: MetaStore and actual data both stored in local system. If any situation, data lost, both actual data and meta store will be lost.
- External table: Schema is stored in Database. Actual data stored in Hive tables. If data lost in External table, it lost only metastore, but not actual data.
- Explain the different types of join in Hive. ↑
HiveQL has 4 different types of joins –
- JOIN- Similar to Outer Join in SQL
- FULL OUTER JOIN – Combines the records of both the left and right outer tables that fulfil the join condition.
- LEFT OUTER JOIN- All the rows from the left table are returned even if there are no matches in the right table.
- RIGHT OUTER JOIN-All the rows from the right table are returned even if there are no matches in the left table.
- What is a metastore in Hive? ↑
It is a relational database storing the metadata of hive tables, partitions, Hive databases etc
- What is the functionality of Query Processor in Apache Hive? ↑
This components implements the processing framework for converting SQL to graph of map/reduce jobs and the execution time framework to run those jobs in the order od dependencies.
- What is the utilization of Hcatalog? ↑
Hcatalog can be utilized to share information structures with external systems. Hcatalog gives access to hive meta-store to clients of other devices on Hadoop with the goal that they can read and compose information to hive’s data warehouse.
- How will you optimize Hive performance? ↑
There are various ways to run Hive queries faster –
- Using Apache Tez execution engine
- Using vectorization
- Using ORCFILE
- do cost based query optimization.
- In case of embedded Hive, can the same metastore be used by multiple users? ↑
We cannot use metastore in sharing mode. It is suggested to use standalone real database like PostGreSQL and MySQL.
- When to use Map reduce mode? ↑
Map reduce mode is used when: – It will perform on large amount of data sets and query going to execute in a parallel way – Hadoop has multiple data nodes, and data is distributed across different node we use Hive in this mode – Processing large data sets with better performance needs to be achieve
- What is the importance of Thrift server & client, JDBC and ODBC driver in Hive? ↑
Thrift is a cross language RPC framework which generate code and cobines a software stack finally execute the Thrift code in remote server. Thrift compiler acts as interpreter between server and client. Thrift server allows a remove client to submit request to Hive, using different programming languages like Python, Ruby and Scala.
JDBC driver: A JDBC driver is a software component enabling a Java application to interact with a database.
ODBC driver: ODBC accomplishes DBMS independence by using an ODBC driver as a translation layer between the application and the DBMS.
Part 4:
1) Explain what is Hive?
Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System (HDFS). It is a data warehouse framework for querying and analysis of data that is stored in HDFS. Hive is an open-source-software that lets programmers analyze large data sets on Hadoop.
2) When to use Hive?
- Hive is useful when making data warehouse applications
- When you are dealing with static data instead of dynamic data
- When application is on high latency (high response time)
- When a large data set is maintained
- When we are using queries instead of scripting
3) Mention what are the different modes of Hive?
Depending on the size of data nodes in Hadoop, Hive can operate in two modes.
These modes are,
- Local mode
- Map reduce mode
4) Mention when to use Map reduce mode?
Map reduce mode is used when,
- It will perform on large amount of data sets and query going to execute in a parallel way
- Hadoop has multiple data nodes, and data is distributed across different node we use Hive in this mode
- Processing large data sets with better performance needs to be achieved
5) Mention key components of Hive Architecture?
Key components of Hive Architecture includes,
- User Interface
- Compiler
- Metastore
- Driver
- Execute Engine
6) Mention what are the different types of tables available in Hive?
There are two types of tables available in Hive.
- Managed table: In managed table, both the data and schema are under control of Hive
- External table: In the external table, only the schema is under the control of Hive.
7) Explain what is Metastore in Hive?
Metastore is a central repository in Hive. It is used for storing schema information or metadata in the external database.
8) Mention what Hive is composed of?
Hive consists of 3 main parts,
- Hive Clients
- Hive Services
- Hive Storage and Computing
9) Mention what are the type of database does hive support?
For single user metadata storage, Hive uses derby database and for multiple user Metadata or shared Metadata case Hive uses MYSQL.
10) Mention Hive default read and write classes?
Hive default read and write classes are
- TextInputFormat/HiveIgnoreKeyTextOutputFormat
- SequenceFileInputFormat/SequenceFileOutputFormat
11) Mention what are the different modes of Hive?
Different modes of Hive depends on the size of data nodes in Hadoop.
These modes are,
- Local mode
- Map reduce mode
12) Why is Hive not suitable for OLTP systems?
Hive is not suitable for OLTP systems because it does not provide insert and update function at the row level.
- Differentiate between Hive and HBase
Hive | HBase |
Enables most of the SQL queries | This doesn’t allow SQL queries |
Doesn’t support record level insert, update, and delete operations on table | It supports |
It is a data warehouse framework | It is NoSQL database |
Hive run on the top of MapReduce | HBase runs on the top of HDFS |
14) Explain what is a Hive variable? What for we use it?
Hive variable is created in the Hive environment that can be referenced by Hive scripts. It is used to pass some values to the hive queries when the query starts executing.
15) Mention what is ObjectInspector functionality in Hive?
ObjectInspector functionality in Hive is used to analyze the internal structure of the columns, rows, and complex objects. It allows to access the internal fields inside the objects.
16) Mention what is (HS2) HiveServer2?
It is a server interface that performs following functions.
- It allows remote clients to execute queries against Hive
- Retrieve the results of mentioned queries
Some advanced features Based on Thrift RPC in its latest version include
- Multi-client concurrency
- Authentication
17) Mention what Hive query processor does?
Hive query processor convert graph of MapReduce jobs with the execution time framework. So that the jobs can be executed in the order of dependencies.
18) Mention what are the components of a Hive query processor?
The components of a Hive query processor include,
- Logical Plan Generation
- Physical Plan Generation
- Execution Engine
- Operators
- UDF’s and UDAF’s
- Optimizer
- Parser
- Semantic Analyzer
- Type Checking
19) Mention what is Partitions in Hive?
Hive organizes tables into partitions.
- It is one of the ways of dividing tables into different parts based on partition keys.
- Partition is helpful when the table has one or more Partition keys.
- Partition keys are basic elements for determining how the data is stored in the table.
20) Mention when to choose “Internal Table” and “External Table” in Hive?
In Hive you can choose internal table,
- If the processing data available in local file system
- If we want Hive to manage the complete lifecycle of data including the deletion
You can choose External table,
- If processing data available in HDFS
- Useful when the files are being used outside of Hive
21) Mention if we can name view same as the name of a Hive table?
No. The name of a view must be unique compared to all other tables and as views present in the same database.
22) Mention what are views in Hive?
In Hive, Views are Similar to tables. They are generated based on the requirements.
- We can save any result set data as a view in Hive
- Usage is similar to as views used in SQL
- All type of DML operations can be performed on a view
23) Explain how Hive Deserialize and serialize the data?
Usually, while read/write the data, the user first communicate with inputformat. Then it connects with Record reader to read/write record. To serialize the data, the data goes to row. Here deserialized custom serde use object inspector to deserialize the data in fields.
24) What is Buckets in Hive?
- The data present in the partitions can be divided further into Buckets
- The division is performed based on Hash of particular columns that is selected in the table.
25) In Hive, how can you enable buckets?
In Hive, you can enable buckets by using the following command,
set.hive.enforce.bucketing=true;
26) In Hive, can you overwrite Hadoop MapReduce configuration in Hive?
Yes, you can overwrite Hadoop MapReduce configuration in Hive.
27) Explain how can you change a column data type in Hive?
You can change a column data type in Hive by using command,
ALTER TABLE table_name CHANGE column_name column_name new_datatype;
28) Mention what the difference between orders is by and sort by in Hive?
- SORT BY will sort the data within each reducer. You can use any number of reducers for SORT BY operation.
- ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in hive uses a single
29) Explain when to use explode in Hive?
Hadoop developers sometimes take an array as input and convert into a separate table row. To convert complex data types into desired table formats, Hive use explode.
30) Mention how can you stop a partition form being queried?
You can stop a partition form being queried by using the ENABLE OFFLINE clause with ALTER TABLE statement.
31) Compare Pig and Hive
Criteria | Pig | Hive |
Architecture | Procedural data flow language | SQL type declarative language |
Application | Programming purposes | Report creation |
Operational field | Client side | Server side |
Support for avro files | Yes | No |
32) What is the definition of Hive? What is the present version of Hive and explain about ACID transactions in Hive?
Hive is an open source data warehouse system. We can use Hive for analyzing and querying in large data sets of Hadoop files. It’s similar to SQL. The present version of hive is 0.13.1. Hive supports ACID transactions: The full form of ACID is Atomicity, Consistency, Isolation, and Durability. ACID transactions are provided at the row levels, there are Insert, Delete, and Update options so that Hive supports ACID transaction.
- Insert
- Delete
- Update
33) Explain what a Hive variable is. What do we use it for?
Hive variable is basically created in the Hive environment that is referenced by Hive scripting languages. It provides to pass some values to the hive queries when the query starts executing. It uses the source command.
34) What kind of data warehouse application is suitable for Hive? What are the types of tables in Hive?
Hive is not considered as a full database. The design rules and regulations of Hadoop and HDFS put restrictions on what Hive can do. Hive is most suitable for data warehouse applications.
Where:
- Analyzing the relatively static data.
- Less Responsive time.
- No rapid changes in data.Hive doesn’t provide fundamental features required for OLTP, Online Transaction Processing.Hive is suitable for data warehouse applications in large data sets.Two types of tables in Hive
- Managed table.
- External table.
35) Can We Change settings within Hive Session? If Yes, How?
Yes we can change the settings within Hive session, using the SET command. It helps to change Hive job settings for an exact query.
Example: The following commands shows buckets are occupied according to the table definition.
hive> SET hive.enforce.bucketing=true;
We can see the current value of any property by using SET with the property name. SET will list all the properties with their values set by Hive.
hive> SET hive.enforce.bucketing;
hive.enforce.bucketing=true
And this list will not include defaults of Hadoop. So we should use the below like
SET -v
It will list all the properties including the Hadoop defaults in the system.
36) Is it possible to add 100 nodes when we have 100 nodes already in Hive? How?
Yes, we can add the nodes by following the below steps.
- Take a new system create a new username and password.
- Install the SSH and with master node setup ssh connections.
- Add ssh public_rsa id key to the authorized keys file.
- Add the new data node host name, IP address and other details in /etc/hosts slaves file
1.102 slave3.in slave3. - Start the Data Node on New Node.
- Login to the new node like suhadoop or ssh -X hadoop@192.168.1.103.
- Start HDFS of a newly added slave node by using the following command
./bin/hadoop-daemon.sh start data node. - Check the output of jps command on a new node
37) Explain the concatenation function in Hive with an example .
Concatenate function will join the input strings.We can specify the
‘N’ number of strings separated by a comma.
Example:
CONCAT (‘Intellipaat’,’-‘,’is’,’-‘,’a’,’-‘,’eLearning’,’-’,’provider’);
Output:
Intellipaat-is-a-eLearning-provider
So, every time we set the limits of the strings by ‘-‘. If it is common for every strings, then Hive provides another command
CONCAT_WS. In this case,we have to specify the set limits of operator first.
CONCAT_WS (‘-‘,’Intellipaat’,’is’,’a’,’eLearning’,‘provider’);
Output: Intellipaat-is-a-eLearning-provider.
38) Trim and Reverse function in Hive with examples.
Trim function will delete the spaces associated with a string.
Example:
TRIM(‘ INTELLIPAAT ‘);
Output:
INTELLIPAAT
To remove the Leading space
LTRIM(‘ INTELLIPAAT’);
To remove the trailing space
RTRIM(‘INTELLIPAAT ‘);
In Reverse function, characters are reversed in the string.
Example:
REVERSE(‘INTELLIPAAT’);
Output:
TAAPILLETNI
39) How to change the column data type in Hive? Explain RLIKE in Hive.
We can change the column data type by using ALTER and CHANGE.
The syntax is :
ALTER TABLE table_name CHANGE column_namecolumn_namenew_datatype;
Example: If we want to change the data type of the salary column from integer to bigint in the employee table.
ALTER TABLE employee CHANGE salary salary BIGINT;RLIKE: Its full form is Right-Like and it is a special function in the Hive. It helps to examine the two substrings. i.e, if the substring of A matches with B then it evaluates to true.
Example:
Trueà‘Intellipaat’ RLIKE ‘tell’
True (this is a regular expression)à‘Intellipaat’ RLIKE ‘^I.*’
40) What are the components used in Hive query processor?
The components of a Hive query processor include
- Logical Plan of Generation.
- Physical Plan of Generation.
- Execution Engine.
- UDF’s and UDAF’s.
- Semantic Analyzer.
- Type Checking
41) What is Buckets in Hive?
The present data is partitioned and divided into different Buckets. This data is divided on the basis of Hash of the particular table columns.
42) Explain process to access sub directories recursively in Hive queries.
By using below commands we can access sub directories recursively in Hive
hive> Set mapred.input.dir.recursive=true;
hive> Set hive.mapred.supports.subdirectories=true;
Hive tables can be pointed to the higher level directory and this is suitable for the directory structure which is like /data/country/state/city/
43) What are the components used in Hive query processor?
The components of a Hive query processor include
- Logical Plan of Generation
- Physical Plan of Generation
- Execution Engine
- Operators
- UDF’s and UDAF’s
- Optimizer
- Parser
- Semantic Analyzer
- Type Checking
44) How to skip header rows from a table in Hive?
Header records in log files
System=….
Version=…
Sub-version=….
In the above three lines of headers that we do not want to include in our Hive query. To skip header lines from our tables in the Hive,set a table property that will allow us to skip the header lines.
CREATE EXTERNAL TABLE employee (
name STRING,
job STRING,
dob STRING,
id INT,
salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘ ‘ STORED AS TEXTFILE
LOCATION ‘/user/data’
TBLPROPERTIES(“skip.header.line.count”=”2”);
45) What is the maximum size of string data type supported by hive? Mention the Hive support binary formats.
The maximum size of string data type supported by hive is 2 GB.
Hive supports the text file format by default and it supports the binary format Sequence files, ORC files, Avro Data files, Parquet files.
Sequence files: Splittable, compressible and row oriented are the general binary format.
ORC files: Full form of ORC is optimized row columnar format files. It is a Record columnar file and column oriented storage file. It divides the table in row split. In each split stores that value of the first row in the first column and followed sub subsequently.
AVRO data files: It is same as a sequence file splittable, compressible and row oriented, but except the support of schema evolution and multilingual binding support.
46) What is the precedence order of HIVE configuration?
We are using a precedence hierarchy for setting the properties
- SET Command in HIVE
- The command line –hiveconf option
- Hive-site.XML
- Hive-default.xml
- Hadoop-site.xml
- Hadoop-default.xml
47) If you run a select * query in Hive, Why does it not run MapReduce?
The hive.fetch.task.conversion property of Hive lowers the latency of mapreduce overhead and in effect when executing queries like SELECT, FILTER, LIMIT, etc., it skips mapreduce function
48) How Hive can improve performance with ORC format tables?
We can store the hive data in highly efficient manner in the Optimized Row Columnar file format. It can simplify many Hive file format limitations. We can improve the performance by using ORC files while reading, writing and processing the data.
Set hive.compute.query.using.stats-true;
Set hive.stats.dbclass-fs;
CREATE TABLE orc_table (
idint,
name string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\:’
LINES TERMINATED BY ‘\n’
STORES AS ORC;
49) Explain the functionality of Object-Inspector.
It helps to analyze the internal structure of row object and individual structure of columns in HIVE. It also provides a uniform way to access complex objects that can be stored in multiple formats in the memory.
Instance of Java class
A standard Java object
A lazily initialized object
The Object-Inspector tells structure of the object and also ways to access the internal fields inside the object.
50) Whenever we run hive query, new metastore_db is created. Why?
Local metastore is created when we run Hive in embedded mode. And before creating it checks whether the metastore exists or not and this metastore property is defined in the configuration file hive-site.xml. Property is“javax.jdo.option.ConnectionURL” with default value “jdbc:derby:;databaseName=metastore_db;create=true”.So to change the behavior of the location to an absolute path, so that from that location meta-store will be used.
51) How can we access the sub directories recursively?
By using below commands we can access sub directories recursively in Hive
hive> Set mapred.input.dir.recursive=true;
hive> Set hive.mapred.supports.subdirectories=true;
Hive tables can be pointed to the higher level directory and this is suitable for the directory structure which is like /data/country/state/city/
52) What are the uses of explode Hive?
Hadoop developers consider the array as their inputs and convert them into a separate table row. To convert complicate data types into desired table formats Hive is essentially using explode.
53) What is available mechanism for connecting from applications, when we run hive as a server?
- Thrift Client: Using thrift you can call hive commands from various programming languages. Example: C++, PHP,Java, Python and Ruby.
- JDBC Driver: JDBC Driver supports the Type 4 (pure Java) JDBC Driver
- ODBC Driver: ODBC Driver supports the ODBC protocol.
54) How do we write our own custom SerDe?
End users want to read their own data format instead of writing, so the user wants to write a Deserializer than SerDe.
Example: The RegexDeserializer will deserialize the data using the configuration parameter ‘regex’, and a list of column names.
If our SerDe supports DDL, we probably want to implement a protocol based on DynamicSerDe. It’s non-trivial to write a “thrift DDL” parser.
55) Mention the date data type in Hive. Name the Hive data type collection.
The TIMESTAMP data type stores date in java.sql.timestamp format.
Three collection data types in Hive
- ARRAY
- MAP
- STRUCT
56) Can we run UNIX shell commands from Hive? Can Hive queries be executed from script files? How? Give an example.
Yes, we can run UNIX shell commands from Hive using the! Mark before the command .For example:!pwd at hive prompt will list the current directory.
We can execute Hive queries from the script files by using the source command.