Apache Spark Interview Question and Answer

Apache Spark Interview Q & A

Part 1:

1) What is Apache Spark?

Apache Spark is easy to use and flexible data processing framework. Spark can round on Hadoop, standalone, or in the cloud. It is capable of assessing diverse data source, which includes HDFS, Cassandra, and others.

2) Explain Dsstream with reference to Apache Spark

Dstream is a sequence of resilient distributed database which represent a stream of data. You can create Dstream from various source like HDFS, Apache Flume, Apache Kafka, etc.

3) Name three data source available in SparkSQL

There data source available in SparkSQL are:

JSON Datasets
Hive tables
Parquet file

4) Name some internal daemons used in spark?

Important daemon used in spark are Blockmanager, Memestore, DAGscheduler, Driver, Worker, Executor, Tasks,etc.

5) Define the term ‘Sparse Vector.’

Sparse vector is a vector which has two parallel arrays, one for indices, one for values, use for storing non-zero entities to save space.

6) Name the language supported by Apache Spark for developing big data applications

Important language use for developing big data application are:

Java
Python
R
Clojure
Scala

7) What is the method to create a Data frame?

In Apache Spark, a Data frame can be created using Tables in Hive and Structured data files.

8) Explain SchemaRDD

An RDD which consists of row object with schema information about the type of data in each column is called SchemaRDD.

9) What are accumulators?

Accumulators are the write-only variables. They are initialized once and sent to the workers. These workers will update based on the logic written, which will send back to the driver.

10) What are the components of Spark Ecosystem?

An important component of Spark are:

Spark Core: It is a base engine for large-scale parallel and distributed data processing
Spark Streaming: This component used for real-time data streaming.
Spark SQL: Integrates relational processing by using Spark’s functional programming API
GraphX: Allows graphs and graph-parallel computation
MLlib: Allows you to perform machine learning in Apache Spark

11) Name three features of using Apache Spark

Three most important feature of using Apache Spark are:

Support for Sophisticated Analytics
Helps you to Integrate with Hadoop and Existing Hadoop Data
It allows you to run an application in Hadoop cluster, up to 100 times faster in memory, and ten times faster on disk.

12) Explain the default level of parallelism in Apache Spark

If the user isn’t able to specify, then the number of partitions are considered as default level of parallelism in Apache Spark.

13) Name three companies which is used Spark Streaming services

Three known companies using Spark Streaming services are:

Uber
Netflix
Pinterest

14) What is Spark SQL?

Spark SQL is a module for structured data processing where we take advantage of SQL queries running on that database.

15) Explain Parquet file

Paraquet is a columnar format file support by many other data processing systems. Spark SQL allows you to performs both read and write operations with Parquet file.

16) Explain Spark Driver?

Spark Driver is the program which runs on the master node of the machine and declares transformations and actions on data RDDs.

17) How can you store the data in spark?

Spark is a processing engine which doesn’t have any storage engine. It can retrieve data from another storage engine like HDFS, S3.

18) Explain the use of File system API in Apache Spark

File system API allows you to read data from various storage devices like HDFS, S3 or local Fileyste.

19) What is the task of Spark Engine

Spark Engine is helpful for scheduling, distributing and monitoring the data application across the cluster.

20) What is the user of sparkContext?
SparkContent is the entry point to spark. SparkContext allows you to create RDDs which provided various way of churning data.

21) How can you implement machine learning in Spark?

MLif is a versatile machine learning library given by Spark.

22) Can you do real-time processing with Spark SQL?

Real-time data processing is not possible directly. However, it is possible by registering existing RDD as a SQL table and trigger the SQL queries on priority.

23) What are the important differences between Apache and Hadoop

Parameter	Apache Spark	Hadoop
Speed	100 times faster compares to Hadoop.	It has moderate speed.
Processing	Real-time batch processing functionality.	It offers batch processing only.
Learning curve	Easy	Hard
Interactivity	It has interactive modes	Apart from Pig and Hive, it has not an interactive way.

24) can you run Apache Spark On Apache Mesos?

Yes, you can run Apache Spark on the hardware clusters managed by Mesos.

25) Explain partitions

Partition is a smaller and logical division of data. It is the method for deriving logical units of data to speed up the processing process.

26) Define the term ‘Lazy Evolution’ with reference to Apache Spark

Apache Spark delays its evaluation until it is needed. For the transformations, Spark adds them to a DAG of computation and only when derive request some data.

27) Explain the use of broadcast variables

The most common use of broadcast variables are:

Broadcast variables help programmer to keep a read-only variable cached on each machine instead of shipping a copy of it with tasks.
You can also use them to give every node a copy of a large input dataset in an efficient manner.
Broadcast algorithms also help you to reduce communication cost

28) How you can use Akka with Spark?

Spark uses Akka use for scheduling. It also uses Akka for messaging between the workers and masters.

29) Which the fundamental data structure of Spark

Data frame is fundamental is the fundamental data structure of Spark.

30) Can you use Spark for ETL process?

Yes, you can use spark for the ETL process.

31) What is the use of map transformation?

Map transformation on an RDD produces another RDD by translating each element. It helps you to translates every element by executing the function provided by the user.

32) What are the disadvantages of using Spark?

The following are some of the disadvantages of using Spark:

Spark consume a huge amount of data compared with Hadoop.
You can’t run everything on a single node as work must be distrusted over multiple clusters.
Developers needs extra care while running their application in Spark.
Spark streaming does not provide support for record-based window criteria.

33) What are common uses of Apache Spark?

Apache Spark is used for:
Interactive machine learning
Stream processing
Data analytics and processing
Sensor data processing

34) State the difference between persist() and cache() functions.

Persist() function allows the user to specify the storage level whereas cache() use the default storage level.

35) Name the Spark Library which allows reliable file sharing at memory speed across different cluster frameworks.

Tachyon is a spark library which allows reliable file sharing at memory speed across various cluster frameworks.

36) Apache Spark is a good fit for which type of machine learning techniques?

Apache Spark is ideal for simple machine learning algorithms like clustering, regression, and classification.

37) How you can remove the element with a critical present in any other Rdd is Apache spark?

In order to remove the elements with a key present in any other rdd, you need to use substractkey() function.

38) What is the use of checkpoints in spark?

Checkpoints allow the program to run all around the clock. Moreover, it helps to make it resilient towards failure irrespective to application logic.

39) Explain lineage graph

Lineage graph information computer each RDD on demand. Therefore, whenever a part of persistent RDD is lost. In that situation, you can recover this data using lineage graph information.

40) What are the file formats supported by spark?

Spark supports file format json, tsv, snappy, orc, rc, etc.

41) What are Actions?

Action helps you to bring back the data from RDD to the local machine. Its execution is the result of all previously created transformations.

42) What is Yarn?

Yarn is one of the most important features of Apache Spark. Running spark on Yarn makes binary distribution of spark as it is built on Yarn support.

43) Explain Spark Executor

An executor is a Spark process which runs computations and stores the data on the worker node. The final tasks by SparkContent are transferred to the executor for their execution.

44) is it necessary to install Spark on all nodes while running Spark application on Yarn?

No, you don’t necessarily need to install spark on all nodes as spark runs on top of Yarn.

45) What is a worker node in Apache Spark?

A worker node is any node which can run the application code in a cluster.

46) How can you launch Spark jobs inside Hadoop MapReduce?

Spark in MapReduce allows users to run all kind of spark job inside MapReduce without need to obtain admin rights of that application.

47) Explain the process to trigger automatic clean-up in Spark to manage accumulated metadata.

You can trigger automatic clean-ups by seeing the parameter ‘spark.cleaner.ttf or by separating the long-running jobs into various batches and writing the intermediate results to the disk.

48) Explain the use of Blinkdb

BlinkDB is a query engine tool which allows you to execute SQL queries on huge volumes of data and renders query results in the meaningful error bars.

49) Does Hoe Spark handle monitoring and logging in Standalone mode?

Yes, a spark can handle monitoring and logging in standalone mode as it has a web-based user interface.

50) How can you identify whether a given operation is Transformation or Action?

You can identify the operation based on the return type. If the return type is not RDD, then the operation is an action. However, if the return type is the same as the RDD, then the operation is transformation.

51) Can You Use Apache Spark To Analyze and Access Data Stored In Cassandra Databases?

Yes, you can use Spark Cassandra Connector which allows you to access and analyze data stored in Cassandra Database.

52) State the difference between Spark SQL and Hql

SparkSQL is an essential component on the spark Core engine. It supports SQL and Hive Query Language without altering its syntax.

Part 2:

Can you tell me what Apache Spark is about?

Apache Spark is an open-source framework engine that is known for its speed, easy-to-use nature in the field of big data processing and analysis. It also has built-in modules for graph processing, machine learning, streaming, SQL, etc. The spark execution engine supports in-memory computation and cyclic data flow and it can run either on cluster mode or standalone mode and can access diverse data sources like HBase, HDFS, Cassandra, etc.

What are the features of Apache Spark?

High Processing Speed: Apache Spark helps in the achievement of a very high processing speed of data by reducing read-write operations to disk. The speed is almost 100x faster while performing in-memory computation and 10x faster while performing disk computation.
Dynamic Nature: Spark provides 80 high-level operators which help in the easy development of parallel applications.
In-Memory Computation: The in-memory computation feature of Spark due to its DAG execution engine increases the speed of data processing. This also supports data caching and reduces the time required to fetch data from the disk.
Reusability: Spark codes can be reused for batch-processing, data streaming, running ad-hoc queries, etc.
Fault Tolerance: Spark supports fault tolerance using RDD. Spark RDDs are the abstractions designed to handle failures of worker nodes which ensures zero data loss.
Stream Processing: Spark supports stream processing in real-time. The problem in the earlier MapReduce framework was that it could process only already existing data.
Lazy Evaluation: Spark transformations done using Spark RDDs are lazy. Meaning, they do not generate results right away, but they create new RDDs from existing RDD. This lazy evaluation increases the system efficiency.
Support Multiple Languages: Spark supports multiple languages like R, Scala, Python, and Java which provides dynamicity and helps in overcoming the Hadoop limitation of application development only using Java.
Hadoop Integration: Spark also supports the Hadoop YARN cluster manager thereby making it flexible.
Supports Spark GraphXfor graph parallel execution, Spark SQL, libraries for Machine learning, etc.
Cost Efficiency: Apache Spark is considered a better cost-efficient solution when compared to Hadoop as Hadoop required large storage and data centers while data processing and replication.
Active Developer’s Community: Apache Spark has a large developer’s base involved in continuous development. It is considered to be the most important project undertaken by the Apache community.

What is RDD?

RDD stands for Resilient Distribution Datasets. It is a fault-tolerant collection of parallel running operational elements. The partitioned data of RDD is distributed and immutable. There are two types of datasets:

Parallelized collections: Meant for running parallelly.
Hadoop datasets: These perform operations on file record systems on HDFS or other storage systems.

What does DAG refer to in Apache Spark?

DAG stands for Directed Acyclic Graph with no directed cycles. There would be finite vertices and edges. Each edge from one vertex is directed to another vertex in a sequential manner. The vertices refer to the RDDs of Spark and the edges represent the operations to be performed on those RDDs.

List the types of Deploy Modes in Spark.

There are 2 deploy modes in Spark. They are:

Client Mode: The deploy mode is said to be in client mode when the spark driver component runs on the machine node from where the spark job is submitted.
- The main disadvantage of this mode is if the machine node fails, then the entire job fails.
- This mode supports both interactive shells and the job submission commands.
- The performance of this mode is worst and is not preferred in production environments.
Cluster Mode: If the spark job driver component does not run on the machine from which the spark job has been submitted, then the deploy mode is said to be in cluster mode.
- The spark job launches the driver component within the cluster as a part of the sub-process of Application Master.
- This mode supports deployment only using the spark-submit command (interactive shell mode is not supported).
- Here, since the driver programs are run in Application Master, in case the program fails, the driver program is re-instantiated.
- In this mode, there is a dedicated cluster manager (such as stand-alone, YARN, Apache Mesos, Kubernetes, etc) for allocating the resources required for the job to run as shown in the below architecture.

Apart from the above two modes, if we have to run the application on our local machines for unit testing and development, the deployment mode is called “Local Mode”. Here, the jobs run on a single JVM in a single machine which makes it highly inefficient as at some point or the other there would be a shortage of resources which results in the failure of jobs. It is also not possible to scale up resources in this mode due to the restricted memory and space.

What are receivers in Apache Spark Streaming?

Receivers are those entities that consume data from different data sources and then move them to Spark for processing. They are created by using streaming contexts in the form of long-running tasks that are scheduled for operating in a round-robin fashion. Each receiver is configured to use up only a single core. The receivers are made to run on various executors to accomplish the task of data streaming. There are two types of receivers depending on how the data is sent to Spark:

Reliable receivers: Here, the receiver sends an acknowledgement to the data sources post successful reception of data and its replication on the Spark storage space.
Unreliable receiver: Here, there is no acknowledgement sent to the data sources.

What is the difference between repartition and coalesce?

Repartition	Coalesce
Usage repartition can increase/decrease the number of data partitions.	Spark coalesce can only reduce the number of data partitions.
Repartition creates new data partitions and performs a full shuffle of evenly distributed data.	Coalesce makes use of already existing partitions to reduce the amount of shuffled data unevenly.
Repartition internally calls coalesce with shuffle parameter thereby making it slower than coalesce.	Coalesce is faster than repartition. However, if there are unequal-sized data partitions, the speed might be slightly slower.

What are the data formats supported by Spark?

Spark supports both the raw files and the structured file formats for efficient reading and processing. File formats like paraquet, JSON, XML, CSV, RC, Avro, TSV, etc are supported by Spark.

What do you understand by Shuffling in Spark?

The process of redistribution of data across different partitions which might or might not cause data movement across the JVM processes or the executors on the separate machines is known as shuffling/repartitioning. Partition is nothing but a smaller logical division of data.

It is to be noted that Spark has no control over what partition the data gets distributed across.

What is YARN in Spark?

YARN is one of the key features provided by Spark that provides a central resource management platform for delivering scalable operations throughout the cluster.
YARN is a cluster management technology and a Spark is a tool for data processing.

How is Apache Spark different from MapReduce?

MapReduce	Apache Spark
MapReduce does only batch-wise processing of data.	Apache Spark can process the data both in real-time and in batches.
MapReduce does slow processing of large data.	Apache Spark runs approximately 100 times faster than MapReduce for big data processing.
MapReduce stores data in HDFS (Hadoop Distributed File System) which makes it take a long time to get the data.	Spark stores data in memory (RAM) which makes it easier and faster to retrieve data when needed.
MapReduce highly depends on disk which makes it to be a high latency framework.	Spark supports in-memory data storage and caching and makes it a low latency computation framework.
MapReduce requires an external scheduler for jobs.	Spark has its own job scheduler due to the in-memory data computation.

Explain the working of Spark with the help of its architecture.

Spark applications are run in the form of independent processes that are well coordinated by the Driver program by means of a Spark Session object. The cluster manager or the resource manager entity of Spark assigns the tasks of running the Spark jobs to the worker nodes as per one task per partition principle. There are various iterations algorithms that are repeatedly applied to the data to cache the datasets across various iterations. Every task applies its unit of operations to the dataset within its partition and results in the new partitioned dataset. These results are sent back to the main driver application for further processing or to store the data on the disk. The following diagram illustrates this working as described above:

What is the working of DAG in Spark?

DAG stands for Direct Acyclic Graph which has a set of finite vertices and edges. The vertices represent RDDs and the edges represent the operations to be performed on RDDs sequentially. The DAG created is submitted to the DAG Scheduler which splits the graphs into stages of tasks based on the transformations applied to the data. The stage view has the details of the RDDs of that stage.

The working of DAG in spark is defined as per the workflow diagram below:

The first task is to interpret the code with the help of an interpreter. If you use the Scala code, then the Scala interpreter interprets the code.
Spark then creates an operator graph when the code is entered in the Spark console.
When the action is called on Spark RDD, the operator graph is submitted to the DAG Scheduler.
The operators are divided into stages of task by the DAG Scheduler. The stage consists of detailed step-by-step operation on the input data. The operators are then pipelined together.
The stages are then passed to the Task Scheduler which launches the task via the cluster manager to work on independently without the dependencies between the stages.
The worker nodes then execute the task.

Each RDD keeps track of the pointer to one/more parent RDD along with its relationship with the parent. For example, consider the operation val childB=parentA.map() on RDD, then we have the RDD childB that keeps track of its parentA which is called RDD lineage.

Under what scenarios do you use Client and Cluster modes for deployment?

In case the client machines are not close to the cluster, then the Cluster mode should be used for deployment. This is done to avoid the network latency caused while communication between the executors which would occur in the Client mode. Also, in Client mode, the entire process is lost if the machine goes offline.
If we have the client machine inside the cluster, then the Client mode can be used for deployment. Since the machine is inside the cluster, there won’t be issues of network latency and since the maintenance of the cluster is already handled, there is no cause of worry in cases of failure.

What is Spark Streaming and how is it implemented in Spark?

Spark Streaming is one of the most important features provided by Spark. It is nothing but a Spark API extension for supporting stream processing of data from different sources.

Data from sources like Kafka, Kinesis, Flume, etc are processed and pushed to various destinations like databases, dashboards, machine learning APIs, or as simple as file systems. The data is divided into various streams (similar to batches) and is processed accordingly.
Spark streaming supports highly scalable, fault-tolerant continuous stream processing which is mostly used in cases like fraud detection, website monitoring, website click baits, IoT (Internet of Things) sensors, etc.
Spark Streaming first divides the data from the data stream into batches of X seconds which are called Dstreams or Discretized Streams. They are internally nothing but a sequence of multiple RDDs. The Spark application does the task of processing these RDDs using various Spark APIs and the results of this processing are again returned as batches. The following diagram explains the workflow of the spark streaming process.

Write a spark program to check if a given keyword exists in a huge text file or not?

def keywordExists(line):

if (line.find(“my_keyword”) > -1):

return 1

return 0

lines = sparkContext.textFile(“test_file.txt”);

isExist = lines.map(keywordExists);

sum = isExist.reduce(sum);

print(“Found” if sum>0 else “Not Found”)

What can you say about Spark Datasets?

Spark Datasets are those data structures of SparkSQL that provide JVM objects with all the benefits (such as data manipulation using lambda functions) of RDDs alongside Spark SQL-optimised execution engine. This was introduced as part of Spark since version 1.6.

Spark datasets are strongly typed structures that represent the structured queries along with their encoders.
They provide type safety to the data and also give an object-oriented programming interface.
The datasets are more structured and have the lazy query expression which helps in triggering the action. Datasets have the combined powers of both RDD and Dataframes. Internally, each dataset symbolizes a logical plan which informs the computational query about the need for data production. Once the logical plan is analyzed and resolved, then the physical query plan is formed that does the actual query execution.

Datasets have the following features:

Optimized Query feature: Spark datasets provide optimized queries using Tungsten and Catalyst Query Optimizer frameworks. The Catalyst Query Optimizer represents and manipulates a data flow graph (graph of expressions and relational operators). The Tungsten improves and optimizes the speed of execution of Spark job by emphasizing the hardware architecture of the Spark execution platform.
Compile-Time Analysis: Datasets have the flexibility of analyzing and checking the syntaxes at the compile-time which is not technically possible in RDDs or Dataframes or the regular SQL queries.
Interconvertible: The type-safe feature of datasets can be converted to “untyped” Dataframes by making use of the following methods provided by the Datasetholder:
- toDS():Dataset[T]
- toDF():DataFrame
- toDF(columName:String*):DataFrame
Faster Computation:Datasets implementation are much faster than those of the RDDs which helps in increasing the system performance.
Persistent storage qualified: Since the datasets are both queryable and serializable, they can be easily stored in any persistent storages.
Less Memory Consumed: Spark uses the feature of caching to create a more optimal data layout. Hence, less memory is consumed.
Single Interface Multiple Languages: Single API is provided for both Java and Scala languages. These are widely used languages for using Apache Spark. This results in a lesser burden of using libraries for different types of inputs.

Define Spark DataFrames.

Spark Dataframes are the distributed collection of datasets organized into columns similar to SQL. It is equivalent to a table in the relational database and is mainly optimized for big data operations.
Dataframes can be created from an array of data from different data sources such as external databases, existing RDDs, Hive Tables, etc. Following are the features of Spark Dataframes:

Spark Dataframes have the ability of processing data in sizes ranging from Kilobytes to Petabytes on a single node to large clusters.
They support different data formats like CSV, Avro, elastic search, etc, and various storage systems like HDFS, Cassandra, MySQL, etc.
By making use of SparkSQL catalyst optimizer, state of art optimization is achieved.
It is possible to easily integrate Spark Dataframes with major Big Data tools using SparkCore.

Define Executor Memory in Spark

The applications developed in Spark have the same fixed cores count and fixed heap size defined for spark executors. The heap size refers to the memory of the Spark executor that is controlled by making use of the property spark.executor.memory that belongs to the -executor-memory flag. Every Spark applications have one allocated executor on each worker node it runs. The executor memory is a measure of the memory consumed by the worker node that the application utilizes.

What are the functions of SparkCore?

SparkCore is the main engine that is meant for large-scale distributed and parallel data processing. The Spark core consists of the distributed execution engine that offers various APIs in Java, Python, and Scala for developing distributed ETL applications.
Spark Core does important functions such as memory management, job monitoring, fault-tolerance, storage system interactions, job scheduling, and providing support for all the basic I/O functionalities. There are various additional libraries built on top of Spark Core which allows diverse workloads for SQL, streaming, and machine learning. They are responsible for:

Fault recovery
Memory management and Storage system interactions
Job monitoring, scheduling, and distribution
Basic I/O functions

What do you understand by worker node?

Worker nodes are those nodes that run the Spark application in a cluster. The Spark driver program listens for the incoming connections and accepts them from the executors addresses them to the worker nodes for execution. A worker node is like a slave node where it gets the work from its master node and actually executes them. The worker nodes do data processing and report the resources used to the master. The master decides what amount of resources needs to be allocated and then based on their availability, the tasks are scheduled for the worker nodes by the master.

What are some of the demerits of using Spark in applications?

Despite Spark being the powerful data processing engine, there are certain demerits to using Apache Spark in applications. Some of them are:

Spark makes use of more storage space when compared to MapReduce or Hadoop which may lead to certain memory-based problems.
Care must be taken by the developers while running the applications. The work should be distributed across multiple clusters instead of running everything on a single node.
Since Spark makes use of “in-memory” computations, they can be a bottleneck to cost-efficient big data processing.
While using files present on the path of the local filesystem, the files must be accessible at the same location on all the worker nodes when working on cluster mode as the task execution shuffles between various worker nodes based on the resource availabilities. The files need to be copied on all worker nodes or a separate network-mounted file-sharing system needs to be in place.
One of the biggest problems while using Spark is when using a large number of small files. When Spark is used with Hadoop, we know that HDFS gives a limited number of large files instead of a large number of small files. When there is a large number of small gzipped files, Spark needs to uncompress these files by keeping them on its memory and network. So large amount of time is spent in burning core capacities for unzipping the files in sequence and performing partitions of the resulting RDDs to get data in a manageable format which would require extensive shuffling overall. This impacts the performance of Spark as much time is spent preparing the data instead of processing them.
Spark doesn’t work well in multi-user environments as it is not capable of handling many users concurrently.

How can the data transfers be minimized while working with Spark?

Data transfers correspond to the process of shuffling. Minimizing these transfers’ results in faster and reliable running Spark applications. There are various ways in which these can be minimized. They are:

Usage of Broadcast Variables: Broadcast variables increases the efficiency of the join between large and small RDDs.
Usage of Accumulators: These help to update the variable values parallelly during execution.
Another common way is to avoid the operations which trigger these reshuffles.

What is SchemaRDD in Spark RDD?

SchemaRDD is an RDD consisting of row objects that are wrappers around integer arrays or strings that has schema information regarding the data type of each column. They were designed to ease the lives of developers while debugging the code and while running unit test cases on the SparkSQL modules. They represent the description of the RDD which is similar to the schema of relational databases. SchemaRDD also provides the basic functionalities of the common RDDs along with some relational query interfaces of SparkSQL.

Consider an example. If you have an RDD named Person that represents a person’s data. Then SchemaRDD represents what data each row of Person RDD represents. If the Person has attributes like name and age, then they are represented in SchemaRDD.

What module is used for implementing SQL in Apache Spark?

Spark provides a powerful module called SparkSQL which performs relational data processing combined with the power of the functional programming feature of Spark. This module also supports either by means of SQL or Hive Query Language. It also provides support for different data sources and helps developers write powerful SQL queries using code transformations.
The four major libraries of SparkSQL are:

Data Source API
DataFrame API
Interpreter & Catalyst Optimizer
SQL Services

Spark SQL supports the usage of structured and semi-structured data in the following ways:

Spark supports DataFrame abstraction in various languages like Python, Scala, and Java along with providing good optimization techniques.
SparkSQL supports data read and writes operations in various structured formats like JSON, Hive, Parquet, etc.
SparkSQL allows data querying inside the Spark program and via external tools that do the JDBC/ODBC connections.
It is recommended to use SparkSQL inside the Spark applications as it empowers the developers to load the data, query the data from databases and write the results to the destination.

What are the different persistence levels in Apache Spark?

Spark persists intermediary data from different shuffle operations automatically. But it is recommended to call the persist() method on the RDD. There are different persistence levels for storing the RDDs on memory or disk or both with different levels of replication. The persistence levels available in Spark are:

MEMORY_ONLY: This is the default persistence level and is used for storing the RDDs as the deserialized version of Java objects on the JVM. In case the RDDs are huge and do not fit in the memory, then the partitions are not cached and they will be recomputed as and when needed.
MEMORY_AND_DISK: The RDDs are stored again as deserialized Java objects on JVM. In case the memory is insufficient, then partitions not fitting on the memory will be stored on disk and the data will be read from the disk as and when needed.
MEMORY_ONLY_SER: The RDD is stored as serialized Java Objects as One Byte per partition.
MEMORY_AND_DISK_SER: This level is similar to MEMORY_ONLY_SERbut the difference is that the partitions not fitting in the memory are saved on the disk to avoid recomputations on the fly.
DISK_ONLY: The RDD partitions are stored only on the disk.
OFF_HEAP: This level is the same as the MEMORY_ONLY_SERbut here the data is stored in the off-heap memory.

The syntax for using persistence levels in the persist() method is:

df.persist(StorageLevel.<level_value>)

The following table summarizes the details of persistence levels:

Persistence Level	Space Consumed	CPU time	In-memory?	On-disk?
MEMORY_ONLY	High	Low	Yes	No
MEMORY_ONLY_SER	Low	High	Yes	No
MEMORY_AND_DISK	High	Medium	Some	Some
MEMORY_AND_DISK_SER	Low	High	Some	Some
DISK_ONLY	Low	High	No	Yes
OFF_HEAP	Low	High	Yes (but off-heap)	No

What are the steps to calculate the executor memory?

Consider you have the below details regarding the cluster:

Number of nodes = 10

Number of cores in each node = 15 cores

RAM of each node = 61GB

To identify the number of cores, we follow the approach:

Number of Cores = number of concurrent tasks that can be run parallelly by the executor. The optimal value as part of a general rule of thumb is 5.

Hence to calculate the number of executors, we follow the below approach:

Number of executors = Number of cores/Concurrent Task

= 15/5

= 3

Number of executors = Number of nodes * Number of executor in each node

= 10 * 3

= 30 executors per Spark job

Why do we need broadcast variables in Spark?

Broadcast variables let the developers maintain read-only variables cached on each machine instead of shipping a copy of it with tasks. They are used to give every node copy of a large input dataset efficiently. These variables are broadcasted to the nodes using different algorithms to reduce the cost of communication.

Differentiate between Spark Datasets, Dataframes and RDDs.

Criteria	Spark Datasets	Spark Dataframes	Spark RDDs
Representation of Data	Spark Datasets is a combination of Dataframes and RDDs with features like static type safety and object-oriented interfaces.	Spark Dataframe is a distributed collection of data that is organized into named columns.	Spark RDDs are a distributed collection of data without schema.
Optimization	Datasets make use of catalyst optimizers for optimization.	Dataframes also makes use of catalyst optimizer for optimization.	There is no built-in optimization engine.
Schema Projection	Datasets find out schema automatically using SQL Engine.	Dataframes also find the schema automatically.	Schema needs to be defined manually in RDDs.
Aggregation Speed	Dataset aggregation is faster than RDD but slower than Dataframes.	Aggregations are faster in Dataframes due to the provision of easy and powerful APIs.	RDDs are slower than both the Dataframes and the Datasets while performing even simple operations like data grouping.

Can Apache Spark be used along with Hadoop? If yes, then how?

Yes! The main feature of Spark is its compatibility with Hadoop. This makes it a powerful framework as using the combination of these two helps to leverage the processing capacity of Spark by making use of the best of Hadoop’s YARN and HDFS features.

Hadoop can be integrated with Spark in the following ways:

HDFS: Spark can be configured to run atop HDFS to leverage the feature of distributed replicated storage.
MapReduce: Spark can also be configured to run alongside the MapReduce in the same or different processing framework or Hadoop cluster. Spark and MapReduce can be used together to perform real-time and batch processing respectively.
YARN: Spark applications can be configured to run on YARN which acts as the cluster management framework.

What are Sparse Vectors? How are they different from dense vectors?

Sparse vectors consist of two parallel arrays where one array is for storing indices and the other for storing values. These vectors are used to store non-zero values for saving space.

val sparseVec: Vector = Vectors.sparse(5, Array(0, 4), Array(1.0, 2.0))

In the above example, we have the vector of size 5, but the non-zero values are there only at indices 0 and 4.
Sparse vectors are particularly useful when there are very few non-zero values. If there are cases that have only a few zero values, then it is recommended to use dense vectors as usage of sparse vectors would introduce the overhead of indices which could impact the performance.
Dense vectors can be defines as follows:

val denseVec = Vectors.dense(4405d,260100d,400d,5.0,4.0,198.0,9070d,1.0,1.0,2.0,0.0)

Usage of sparse or dense vectors does not impact the results of calculations but when used inappropriately, they impact the memory consumed and the speed of calculation.

How are automatic clean-ups triggered in Spark for handling the accumulated metadata?

The clean-up tasks can be triggered automatically either by setting spark.cleaner.ttl parameter or by doing the batch-wise division of the long-running jobs and then writing the intermediary results on the disk.

How is Caching relevant in Spark Streaming?

Spark Streaming involves the division of data stream’s data into batches of X seconds called DStreams. These DStreams let the developers cache the data into the memory which can be very useful in case the data of DStream is used for multiple computations. The caching of data can be done using the cache() method or using persist() method by using appropriate persistence levels. The default persistence level value for input streams receiving data over the networks such as Kafka, Flume, etc is set to achieve data replication on 2 nodes to accomplish fault tolerance.

Caching using cache method:

val cacheDf = dframe.cache()

Caching using persist method:

val persistDf = dframe.persist(StorageLevel.MEMORY_ONLY)

The main advantages of caching are:

Cost efficiency: Since Spark computations are expensive, caching helps to achieve reusing of data and this leads to reuse computations which can save the cost of operations.
Time-efficient: The computation reusage leads to saving a lot of time.
More Jobs Achieved: By saving time of computation execution, the worker nodes can perform/execute more jobs.

Define Piping in Spark.

Apache Spark provides the pipe() method on RDDs which gives the opportunity to compose different parts of occupations that can utilize any language as needed as per the UNIX Standard Streams. Using the pipe() method, the RDD transformation can be written which can be used for reading each element of the RDD as String. These can be manipulated as required and the results can be displayed as String.

What API is used for Graph Implementation in Spark?

Spark provides a powerful API called GraphX that extends Spark RDD for supporting graphs and graph-based computations. The extended property of Spark RDD is called as Resilient Distributed Property Graph which is a directed multi-graph that has multiple parallel edges. Each edge and the vertex has associated user-defined properties. The presence of parallel edges indicates multiple relationships between the same set of vertices. GraphX has a set of operators such as subgraph, mapReduceTriplets, joinVertices, etc that can support graph computation. It also includes a large collection of graph builders and algorithms for simplifying tasks related to graph analytics.

How can you achieve machine learning in Spark?

Spark provides a very robust, scalable machine learning-based library called MLlib. This library aims at implementing easy and scalable common ML-based algorithms and has the features like classification, clustering, dimensional reduction, regression filtering, etc.

Part 3:

How is Apache Spark different from MapReduce?

Apache Spark	MapReduce
Spark processes data in batches as well as in real-time	MapReduce processes data in batches only
Spark runs almost 100 times faster than Hadoop MapReduce	Hadoop MapReduce is slower when it comes to large scale data processing
Spark stores data in the RAM i.e. in-memory. So, it is easier to retrieve it	Hadoop MapReduce data is stored in HDFS and hence takes a long time to retrieve the data
Spark provides caching and in-memory data storage	Hadoop is highly disk-dependent

2. What are the important components of the Spark ecosystem?

Apache Spark has 3 main categories that comprise its ecosystem. Those are:

Language support: Spark can integrate with different languages to applications and perform analytics. These languages are Java, Python, Scala, and R.
Core Components: Spark supports 5 main core components. There are Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and GraphX.
Cluster Management: Spark can be run in 3 environments. Those are the Standalone cluster, Apache Mesos, and YARN.

Explain how Spark runs applications with the help of its architecture.

This is one of the most frequently asked spark interview questions, and the interviewer will expect you to give a thorough answer to it.

Spark applications run as independent processes that are coordinated by the Spark Session object in the driver program. The resource manager or cluster manager assigns tasks to the worker nodes with one task per partition. Iterative algorithms apply operations repeatedly to the data so they can benefit from caching datasets across iterations. A task applies its unit of work to the dataset in its partition and outputs a new partition dataset. Finally, the results are sent back to the driver application or can be saved to the disk.

What are the different cluster managers available in Apache Spark?

Standalone Mode: By default, applications submitted to the standalone mode cluster will run in FIFO order, and each application will try to use all available nodes. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.
Apache Mesos: Apache Mesos is an open-source project to manage computer clusters, and can also run Hadoop applications. The advantages of deploying Spark with Mesos include dynamic partitioning between Spark and other frameworks as well as scalable partitioning between multiple instances of Spark.
Hadoop YARN: Apache YARN is the cluster resource manager of Hadoop 2. Spark can be run on YARN as well.
Kubernetes: Kubernetesis an open-source system for automating deployment, scaling, and management of containerized applications.

What is the significance of Resilient Distributed Datasets in Spark?

Resilient Distributed Datasets are the fundamental data structure of Apache Spark. It is embedded in Spark Core. RDDs are immutable, fault-tolerant, distributed collections of objects that can be operated on in parallel.RDD’s are split into partitions and can be executed on different nodes of a cluster.

RDDs are created by either transformation of existing RDDs or by loading an external dataset from stable storage like HDFS or HBase.

Here is how the architecture of RDD looks like:

So far, if you have any doubts regarding the apache spark interview questions and answers, please comment below.

What is a lazy evaluation in Spark?

When Spark operates on any dataset, it remembers the instructions. When a transformation such as a map() is called on an RDD, the operation is not performed instantly. Transformations in Spark are not evaluated until you perform an action, which aids in optimizing the overall data processing workflow, known as lazy evaluation.

What makes Spark good at low latency workloads like graph processing and Machine Learning?

Apache Spark stores data in-memory for faster processing and building machine learning models. Machine Learning algorithms require multiple iterations and different conceptual steps to create an optimal model. Graph algorithms traverse through all the nodes and edges to generate a graph. These low latency workloads that need multiple iterations can lead to increased performance.

How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

To trigger the clean-ups, you need to set the parameter spark.cleaner.ttlx.

How can you connect Spark to Apache Mesos?

There are a total of 4 steps that can help you connect Spark to Apache Mesos.

Configure the Spark Driver program to connect with Apache Mesos
Put the Spark binary package in a location accessible by Mesos
Install Spark in the same location as that of the Apache Mesos
Configure the spark.mesos.executor.home property for pointing to the location where Spark is installed

What is a Parquet file and what are its advantages?

Parquet is a columnar format that is supported by several data processing systems. With the Parquet file, Spark can perform both read and write operations.

Some of the advantages of having a Parquet file are:

It enables you to fetch specific columns for access.
It consumes less space
It follows the type-specific encoding
It supports limited I/O operations

What is shuffling in Spark? When does it occur?

Shuffling is the process of redistributing data across partitions that may lead to data movement across the executors. The shuffle operation is implemented differently in Spark compared to Hadoop.

Shuffling has 2 important compression parameters:

spark.shuffle.compress – checks whether the engine would compress shuffle outputs or not spark.shuffle.spill.compress – decides whether to compress intermediate shuffle spill files or not

It occurs while joining two tables or while performing byKey operations such as GroupByKey or ReduceByKey

What is the use of coalesce in Spark?

Spark uses a coalesce method to reduce the number of partitions in a DataFrame.

Suppose you want to read data from a CSV file into an RDD having four partitions.

This is how a filter operation is performed to remove all the multiple of 10 from the data.

The RDD has some empty partitions. It makes sense to reduce the number of partitions, which can be achieved by using coalesce.

This is how the resultant RDD would look like after applying to coalesce.

How can you calculate the executor memory?

Consider the following cluster information:

Here is the number of core identification:

To calculate the number of executor identification:

What are the various functionalities supported by Spark Core?

Spark Core is the engine for parallel and distributed processing of large data sets. The various functionalities supported by Spark Core include:

Scheduling and monitoring jobs
Memory management
Fault recovery
Task dispatching

How do you convert a Spark RDD into a DataFrame?

There are 2 ways to convert a Spark RDD into a DataFrame:

Using the helper function – toDF

import com.mapr.db.spark.sql._

val df = sc.loadFromMapRDB(<table-name>)

.where(field(“first_name”) === “Peter”)

.select(“_id”, “first_name”).toDF()

Using SparkSession.createDataFrame

You can convert an RDD[Row] to a DataFrame by

calling createDataFrame on a SparkSession object

def createDataFrame(RDD, schema:StructType)

Explain the types of operations supported by RDDs.

RDDs support 2 types of operation:

Transformations: Transformations are operations that are performed on an RDD to create a new RDD containing the results (Example: map, filter, join, union)

Actions: Actions are operations that return a value after running a computation on an RDD (Example: reduce, first, count)

What is a Lineage Graph?

This is another frequently asked spark interview question. A Lineage Graph is a dependencies graph between the existing RDD and the new RDD. It means that all the dependencies between the RDD will be recorded in a graph, rather than the original data.

The need for an RDD lineage graph happens when we want to compute a new RDD or if we want to recover the lost data from the lost persisted RDD. Spark does not support data replication in memory. So, if any data is lost, it can be rebuilt using RDD lineage. It is also called an RDD operator graph or RDD dependency graph.

What do you understand about DStreams in Spark?

Discretized Streams is the basic abstraction provided by Spark Streaming.

It represents a continuous stream of data that is either in the form of an input source or processed data stream generated by transforming the input stream.

Explain Caching in Spark Streaming.

Caching also known as Persistence is an optimization technique for Spark computations. Similar to RDDs, DStreams also allow developers to persist the stream’s data in memory. That is, using the persist() method on a DStream will automatically persist every RDD of that DStream in memory. It helps to save interim partial results so they can be reused in subsequent stages.

The default persistence level is set to replicate the data to two nodes for fault-tolerance, and for input streams that receive data over the network.

What is the need for broadcast variables in Spark?

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used to give every node a copy of a large input dataset in an efficient manner. Spark distributes broadcast variables using efficient broadcast algorithms to reduce communication costs.

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))

broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value

res0: Array[Int] = Array(1, 2, 3)

How to programmatically specify a schema for DataFrame?

DataFrame can be created programmatically with three steps:

Create an RDD of Rows from the original RDD;
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.

Which transformation returns a new DStream by selecting only those records of the source DStream for which the function returns true?
map(func)
transform(func)
filter(func)
count()

The correct answer is c) filter(func).

Does Apache Spark provide checkpoints?

This is one of the most frequently asked spark interview questions where the interviewer expects a detailed answer (and not just a yes or no!). Give as detailed an answer as possible here.

Yes, Apache Spark provides an API for adding and managing checkpoints. Checkpointing is the process of making streaming applications resilient to failures. It allows you to save the data and metadata into a checkpointing directory. In case of a failure, the spark can recover this data and start from wherever it has stopped.

There are 2 types of data for which we can use checkpointing in Spark.

Metadata Checkpointing: Metadata means the data about data. It refers to saving the metadata to fault-tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches.

Data Checkpointing: Here, we save the RDD to reliable storage because its need arises in some of the stateful transformations. In this case, the upcoming RDD depends on the RDDs of previous batches.

What do you mean by sliding window operation?

Controlling the transmission of data packets between multiple computer networks is done by the sliding window. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data.

What are the different levels of persistence in Spark?

DISK_ONLY – Stores the RDD partitions only on the disk

MEMORY_ONLY_SER – Stores the RDD as serialized Java objects with a one-byte array per partition

MEMORY_ONLY – Stores the RDD as deserialized Java objects in the JVM. If the RDD is not able to fit in the memory available, some partitions won’t be cached

OFF_HEAP – Works like MEMORY_ONLY_SER but stores the data in off-heap memory

MEMORY_AND_DISK – Stores RDD as deserialized Java objects in the JVM. In case the RDD is not able to fit in the memory, additional partitions are stored on the disk

MEMORY_AND_DISK_SER – Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk

What is the difference between map and flatMap transformation in Spark Streaming?

map()	flatMap()
A map function returns a new DStream by passing each element of the source DStream through a function func	It is similar to the map function and applies to each element of RDD and it returns the result as a new RDD
Spark Map function takes one element as an input process it according to custom code (specified by the developer) and returns one element at a time	FlatMap allows returning 0, 1, or more elements from the map function. In the FlatMap operation

How would you compute the total count of unique words in Spark?
Load the text file as RDD:

sc.textFile(“hdfs://Hadoop/user/test_file.txt”);

Function that breaks each line into words:

def toWords(line):

return line.split();

Run the toWords function on each element of RDD in Spark as flatMap transformation:

words = line.flatMap(toWords);

Convert each word into (key,value) pair:

def toTuple(word):

return (word, 1);

wordTuple = words.map(toTuple);

Perform reduceByKey() action:

def sum(x, y):

return x+y:

counts = wordsTuple.reduceByKey(sum)

Print:

counts.collect()

Suppose you have a huge text file. How will you check if a particular keyword exists using Spark?

lines = sc.textFile(“hdfs://Hadoop/user/test_file.txt”);

def isFound(line):

if line.find(“my_keyword”) > -1

return 1

return 0

foundBits = lines.map(isFound);

sum = foundBits.reduce(sum);

if sum > 0:

print “Found”

else:

print “Not Found”;

What is the role of accumulators in Spark?

Accumulators are variables used for aggregating information across the executors. This information can be about the data or API diagnosis like how many records are corrupted or how many times a library API was called.

What are the different MLlib tools available in Spark?

ML Algorithms: Classification, Regression, Clustering, and Collaborative filtering
Featurization: Feature extraction, Transformation, Dimensionality reduction,

and Selection

Pipelines: Tools for constructing, evaluating, and tuning ML pipelines
Persistence: Saving and loading algorithms, models, and pipelines
Utilities: Linear algebra, statistics, data handling

What are the different data types supported by Spark MLlib?

Spark MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices.

Local Vector: MLlib supports two types of local vectors – dense and sparse

Example: vector(1.0, 0.0, 3.0)

dense format: [1.0, 0.0, 3.0]

sparse format: (3, [0, 2]. [1.0, 3.0])

Labeled point: A labeled point is a local vector, either dense or sparse that is associated with a label/response.

Example: In binary classification, a label should be either 0 (negative) or 1 (positive)

Local Matrix: A local matrix has integer type row and column indices, and double type values that are stored in a single machine.

Distributed Matrix: A distributed matrix has long-type row and column indices and double-type values, and is stored in a distributed manner in one or more RDDs.

Types of the distributed matrix:

RowMatrix
IndexedRowMatrix
CoordinatedMatrix

What is a Sparse Vector?

A Sparse vector is a type of local vector which is represented by an index array and a value array.

public class SparseVector

extends Object

implements Vector

Example: sparse1 = SparseVector(4, [1, 3], [3.0, 4.0])

where:

4 is the size of the vector

[1,3] are the ordered indices of the vector

[3,4] are the value

Do you have a better example for this spark interview question? If yes, let us know.

Describe how model creation works with MLlib and how the model is applied.

MLlib has 2 components:

Transformer: A transformer reads a DataFrame and returns a new DataFrame with a specific transformation applied.

Estimator: An estimator is a machine learning algorithm that takes a DataFrame to train a model and returns the model as a transformer.

Spark MLlib lets you combine multiple transformations into a pipeline to apply complex data transformations.

The following image shows such a pipeline for training a model:

The model produced can then be applied to live data:

What are the functions of Spark SQL?

Spark SQL is Apache Spark’s module for working with structured data.

Spark SQL loads the data from a variety of structured data sources.

It queries data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC).

It provides a rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables and expose custom functions in SQL.

How can you connect Hive to Spark SQL?

To connect Hive to Spark SQL, place the hive-site.xml file in the conf directory of Spark.

Using the Spark Session object, you can construct a DataFrame.

result=spark.sql(“select * from <hive_table>”)

What is the role of Catalyst Optimizer in Spark SQL?

Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer.

How can you manipulate structured data using domain-specific language in Spark SQL?

Structured data can be manipulated using domain-Specific language as follows:

Suppose there is a DataFrame with the following information:

val df = spark.read.json(“examples/src/main/resources/people.json”)

// Displays the content of the DataFrame to stdout

df.show()

// +—-+——-+

// | age| name|

// +—-+——-+

// |null|Michael|

// | 30| Andy|

// | 19| Justin|

// +—-+——-+

// Select only the “name” column

df.select(“name”).show()

// +——-+

// | name|

// +——-+

// |Michael|

// | Andy|

// | Justin|

// +——-+

// Select everybody, but increment the age by 1

df.select($”name”, $”age” + 1).show()

// +——-+———+

// | name|(age + 1)|

// +——-+———+

// |Michael| null|

// | Andy| 31|

// | Justin| 20|

// +——-+———+

// Select people older than 21

df.filter($”age” > 21).show()

// +—+—-+

// |age|name|

// +—+—-+

// | 30|Andy|

// +—+—-+

// Count people by age

df.groupBy(“age”).count().show()

// +—-+—–+

// | age|count|

// +—-+—–+

// | 19| 1|

// |null| 1|

// | 30| 1|

// +—-+—–+

What are the different types of operators provided by the Apache GraphX library?

Property Operator: Property operators modify the vertex or edge properties using a user-defined map function and produce a new graph.

Structural Operator: Structure operators operate on the structure of an input graph and produce a new graph.

Join Operator: Join operators add data to graphs and generate new graphs.

What are the analytic algorithms provided in Apache Spark GraphX?

GraphX is Apache Spark’s API for graphs and graph-parallel computation. GraphX includes a set of graph algorithms to simplify analytics tasks. The algorithms are contained in the org.apache.spark.graphx.lib package and can be accessed directly as methods on Graph via GraphOps.

PageRank: PageRank is a graph parallel computation that measures the importance of each vertex in a graph. Example: You can run PageRank to evaluate what the most important pages in Wikipedia are.

Connected Components: The connected components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex. For example, in a social network, connected components can approximate clusters.

Triangle Counting: A vertex is part of a triangle when it has two adjacent vertices with an edge between them. GraphX implements a triangle counting algorithm in the TriangleCount object that determines the number of triangles passing through each vertex, providing a measure of clustering.

What is the PageRank algorithm in Apache Spark GraphX?

It is a plus point if you are able to explain this spark interview question thoroughly, along with an example! PageRank measures the importance of each vertex in a graph, assuming an edge from u to v represents an endorsement of v’s importance by u.

If a Twitter user is followed by many other users, that handle will be ranked high.

PageRank algorithm was originally developed by Larry Page and Sergey Brin to rank websites for Google. It can be applied to measure the influence of vertices in any network graph. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The assumption is that more important websites are likely to receive more links from other websites.

A typical example of using Scala’s functional programming with Apache Spark RDDs to iteratively compute Page Ranks is shown below:

Part 4:

Compare Hadoop and Spark.

We will compare Hadoop MapReduce and Spark based on the following aspects:

Apache Spark vs. Hadoop
Feature Criteria	Apache Spark	Hadoop
Speed	100 times faster than Hadoop	Decent speed
Processing	Real-time & Batch processing	Batch processing only
Difficulty	Easy because of high level modules	Tough to learn
Recovery	Allows recovery of partitions	Fault-tolerant
Interactivity	Has interactive modes	No interactive mode except Pig & Hive

Table: Apache Spark versus Hadoop

Let us understand the same using an interesting analogy.

“Single cook cooking an entree is regular computing. Hadoop is multiple cooks cooking an entree into pieces and letting each cook her piece.
Each cook has a separate stove and a food shelf. The first cook cooks the meat, the second cook cooks the sauce. This phase is called “Map”. A the end the main cook assembles the complete entree. This is called “Reduce”. For Hadoop, the cooks are not allowed to keep things on the stove between operations. Each time you make a particular operation, the cook puts results on the shelf. This slows things down.
For Spark, the cooks are allowed to keep things on the stove between operations. This speeds things up. Finally, for Hadoop the recipes are written in a language which is illogical and hard to understand. For Spark, the recipes are nicely written.” – Stan Kladko, Galactic Exchange.io

What is Apache Spark?

Apache Spark is an open-source cluster computing framework for real-time processing.
It has a thriving open-source community and is the most active Apache project at the moment.
Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

Spark is of the most successful projects in the Apache Software Foundation. Spark has clearly evolved as the market leader for Big Data processing. Many organizations run Spark on clusters with thousands of nodes. Today, Spark is being adopted by major players like Amazon, eBay, and Yahoo!

Explain the key features of Apache Spark.

The following are the key features of Apache Spark:

Polyglot
Speed
Multiple Format Support
Lazy Evaluation
Real Time Computation
Hadoop Integration
Machine Learning

Let us look at these features in detail:

Polyglot: Spark provides high-level APIs in Java, Scala, Python and R. Spark code can be written in any of these four languages. It provides a shell in Scala and Python. The Scala shell can be accessed through ./bin/spark-shell and Python shell through ./bin/pyspark from the installed directory.
Speed: Spark runs upto 100 times faster than Hadoop MapReduce for large-scale data processing. Spark is able to achieve this speed through controlled partitioning. It manages data using partitions that help parallelize distributed data processing with minimal network traffic.
Multiple Formats: Spark supports multiple data sources such as Parquet, JSON, Hive and Cassandra. The Data Sources API provides a pluggable mechanism for accessing structured data though Spark SQL. Data sources can be more than just simple pipes that convert data and pull it into Spark.
Lazy Evaluation: Apache Spark delays its evaluation till it is absolutely necessary. This is one of the key factors contributing to its speed. For transformations, Spark adds them to a DAG of computation and only when the driver requests some data, does this DAG actually gets executed.
Real Time Computation: Spark’s computation is real-time and has less latency because of its in-memory computation. Spark is designed for massive scalability and the Spark team has documented users of the system running production clusters with thousands of nodes and supports several computational models.
Hadoop Integration: Apache Spark provides smooth compatibility with Hadoop. This is a great boon for all the Big Data engineers who started their careers with Hadoop. Spark is a potential replacement for the MapReduce functions of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster using YARN for resource scheduling.
Machine Learning: Spark’s MLlib is the machine learning component which is handy when it comes to big data processing. It eradicates the need to use multiple tools, one for processing and one for machine learning. Spark provides data engineers and data scientists with a powerful, unified engine that is both fast and easy to use.

What are the languages supported by Apache Spark and which is the most popular one?

Apache Spark supports the following four languages: Scala, Java, Python and R. Among these languages, Scala and Python have interactive shells for Spark. The Scala shell can be accessed through ./bin/spark-shell and the Python shell through ./bin/pyspark. Scala is the most used among them because Spark is written in Scala and it is the most popularly used for Spark.

What are benefits of Spark over MapReduce?

Spark has the following benefits over MapReduce:

Due to the availability of in-memory processing, Spark implements the processing around 10 to 100 times faster than Hadoop MapReduce whereas MapReduce makes use of persistence storage for any of the data processing tasks.
Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks from the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.
Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.

What is YARN?

Similar to Hadoop, YARN is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster. YARN is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on YARN, the same way Hadoop Map Reduce can run on YARN. Running Spark on YARN necessitates a binary distribution of Spark as built on YARN support.

Do you need to install Spark on all nodes of YARN cluster?

No, because Spark runs on top of YARN. Spark runs independently from its installation. Spark has some options to use YARN when dispatching jobs to the cluster, rather than its own built-in manager, or Mesos. Further, there are some configurations to run YARN. They include master, deploy-mode, driver-memory, executor-memory, executor-cores, and queue.

Is there any benefit of learning MapReduce if Spark is better than MapReduce?

Yes, MapReduce is a paradigm used by many big data tools including Spark as well. It is extremely relevant to use MapReduce when the data grows bigger and bigger. Most tools like Pig and Hive convert their queries into MapReduce phases to optimize them better.

Explain the concept of Resilient Distributed Dataset (RDD).

RDD stands for Resilient Distribution Datasets. An RDD is a fault-tolerant collection of operational elements that run in parallel. The partitioned data in RDD is immutable and distributed in nature. There are primarily two types of RDD:

Parallelized Collections: Here, the existing RDDs running parallel with one another.
Hadoop Datasets: They perform functions on each file record in HDFS or other storage systems.

RDDs are basically parts of data that are stored in the memory distributed across many nodes. RDDs are lazily evaluated in Spark. This lazy evaluation is what contributes to Spark’s speed.

How do we create RDDs in Spark?

Spark provides two methods to create RDD:

By parallelizing a collection in your Driver program.
This makes use of SparkContext’s ‘parallelize’

method val DataArray = Array(2,4,6,8,10)

val DataRDD = sc.parallelize(DataArray)

By loading an external dataset from external storage like HDFS, HBase, shared file system.
What is Executor Memory in a Spark application?

Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize.

Define Partitions in Apache Spark.

As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce. It is a logical chunk of a large distributed data set. Partitioning is the process to derive logical units of data to speed up the processing process. Spark manages data using partitions that help parallelize distributed data processing with minimal network traffic for sending data between executors. By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks. Everything in Spark is a partitioned RDD.

What operations does RDD support?

RDD (Resilient Distributed Dataset) is main logical data unit in Spark. An RDD has distributed a collection of objects. Distributed means, each RDD is divided into multiple partitions. Each of these partitions can reside in memory or stored on the disk of different machines in a cluster. RDDs are immutable (Read Only) data structure. You can’t change original RDD, but you can always transform it into different RDD with all changes you want.

RDDs support two types of operations: transformations and actions.

Transformations: Transformations create new RDD from existing RDD like map, reduceByKey and filter we just saw. Transformations are executed on demand. That means they are computed lazily.

Actions: Actions return final results of RDD computations. Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system.

What do you understand by Transformations in Spark?

Transformations are functions applied on RDD, resulting into another RDD. It does not execute until an action occurs. map() and filter() are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements from current RDD that pass function argument.

val rawData=sc.textFile(“path to/movies.txt”)

val moviesData=rawData.map(x=>x.split(” “))

As we can see here, rawData RDD is transformed into moviesData RDD. Transformations are lazily evaluated.

Define Actions in Spark.

An action helps in bringing back the data from RDD to the local machine. An action’s execution is the result of all previously created transformations. Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system.

reduce() is an action that implements the function passed again and again until one value if left. take() action takes all the values from RDD to a local node.

1	moviesData.saveAsTextFile(“MoviesData.txt”)

As we can see here, moviesData RDD is saved into a text file called MoviesData.txt.

Define functions of SparkCore.

Spark Core is the base engine for large-scale parallel and distributed data processing. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. SparkCore performs various important functions like memory management, monitoring jobs, fault-tolerance, job scheduling and interaction with storage systems. Further, additional libraries, built atop the core allow diverse workloads for streaming, SQL, and machine learning. It is responsible for:

Memory management and fault recovery
Scheduling, distributing and monitoring jobs on a cluster
Interacting with storage systems

What do you understand by Pair RDD?

Apache defines PairRDD functions class as

1	class PairRDDFunctions[K, V] extends Logging with HadoopMapReduceUtil with Serializable

Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey() method that collects data based on each key and a join() method that combines different RDDs together, based on the elements having the same key.

Name the components of Spark Ecosystem.

Spark Core: Base engine for large-scale parallel and distributed data processing
Spark Streaming: Used for processing real-time streaming data
Spark SQL: Integrates relational processing with Spark’s functional programming API
GraphX: Graphs and graph-parallel computation
MLlib: Performs machine learning in Apache Spark

How is Streaming implemented in Spark? Explain with examples.

Spark Streaming is used for processing real-time streaming data. Thus it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams. The fundamental stream unit is DStream which is basically a series of RDDs (Resilient Distributed Datasets) to process the real-time data. The data from different sources like Flume, HDFS is streamed and finally processed to file systems, live dashboards and databases. It is similar to batch processing as the input data is divided into streams like batches.

Is there an API for implementing graphs in Spark?

GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph.

The property graph is a directed multi-graph which can have multiple edges in parallel. Every edge and vertex have user defined properties associated with it. Here, the parallel edges allow multiple relationships between the same vertices. At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph: a directed multigraph with properties attached to each vertex and edge.

To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and mapReduceTriplets) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.

What is PageRank in GraphX?

PageRank measures the importance of each vertex in a graph, assuming an edge from u to v represents an endorsement of v’s importance by u. For example, if a Twitter user is followed by many others, the user will be ranked highly.

GraphX comes with static and dynamic implementations of PageRank as methods on the PageRank Object. Static PageRank runs for a fixed number of iterations, while dynamic PageRank runs until the ranks converge (i.e., stop changing by more than a specified tolerance). GraphOps allows calling these algorithms directly as methods on Graph.

How is machine learning implemented in Spark?

MLlib is scalable machine learning library provided by Spark. It aims at making machine learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and alike.

Is there a module to implement SQL in Spark? How does it work?

Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing.

Spark SQL integrates relational processing with Spark’s functional programming. Further, it provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.

The following are the four libraries of Spark SQL.

Data Source API
DataFrame API
Interpreter & Optimizer
SQL Service

What is a Parquet file?

Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics formats so far.

Parquet is a columnar format, supported by many data processing systems. The advantages of having a columnar storage are as follows:

Columnar storage limits IO operations.
It can fetch specific columns that you need to access.
Columnar storage consumes less space.
It gives better-summarized data and follows type-specific encoding.

How can Apache Spark be used alongside Hadoop?

The best part of Apache Spark is its compatibility with Hadoop. As a result, this makes for a very powerful combination of technologies. Here, we will be looking at how Spark can benefit from the best of Hadoop. Using Spark and Hadoop together helps us to leverage Spark’s processing to utilize the best of Hadoop’s HDFS and YARN.

Hadoop components can be used alongside Spark in the following ways:

HDFS: Spark can run on top of HDFS to leverage the distributed replicated storage.
MapReduce: Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework.
YARN: Spark applications can also be run on YARN (Hadoop NextGen).
Batch & Real Time Processing: MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing.

What is RDD Lineage?

Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets.

What is Spark Driver?

Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, a driver in Spark creates SparkContext, connected to a given Spark Master.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.

What file systems does Spark support?

The following three file systems are supported by Spark:

Hadoop Distributed File System (HDFS).
Local File system.
Amazon S3

List the functions of Spark SQL.

Spark SQL is capable of:

Loading data from a variety of structured sources.
Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau.
Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more.

What is Spark Executor?

When SparkContext connects to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.

Name types of Cluster Managers in Spark.

The Spark framework supports three major types of Cluster Managers:

Standalone: A basic manager to set up a cluster.
Apache Mesos: Generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications.
YARN: Responsible for resource management in Hadoop.

What do you understand by worker node?

Worker node refers to any node that can run the application code in a cluster. The driver program must listen for and accept incoming connections from its executors and must be network addressable from the worker nodes.

Worker node is basically the slave node. Master node assigns work and worker node actually performs the assigned tasks. Worker nodes process the data stored on the node and report the resources to the master. Based on the resource availability, the master schedule tasks.

Illustrate some demerits of using Spark.

The following are some of the demerits of using Apache Spark:

Since Spark utilizes more storage space compared to Hadoop and MapReduce, there may arise certain problems.
Developers need to be careful while running their applications in Spark.
Instead of running everything on a single node, the work must be distributed over multiple clusters.
Spark’s “in-memory” capability can become a bottleneck when it comes to cost-efficient processing of big data.
Spark consumes a huge amount of data when compared to Hadoop.

List some use cases where Spark outperforms Hadoop in processing.

Sensor Data Processing: Apache Spark’s “In-memory” computing works best here, as data is retrieved and combined from different sources.
Real Time Processing: Spark is preferred over Hadoop for real-time querying of data. e.g. Stock Market Analysis, Banking, Healthcare, Telecommunications, etc.
Stream Processing: For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.
Big Data Processing: Spark runs upto 100 times faster than Hadoop when it comes to processing medium and large-sized datasets.

What is a Sparse Vector?

A sparse vector has two parallel arrays; one for indices and the other for values. These vectors are used for storing non-zero entries to save space.

Can you use Spark to access and analyze data stored in Cassandra databases?

Yes, it is possible if you use Spark Cassandra Connector.To connect Spark to a Cassandra cluster, a Cassandra Connector will need to be added to the Spark project. In the setup, a Spark executor will talk to a local Cassandra node and will only query for local data. It makes queries faster by reducing the usage of the network to send data between Spark executors (to process data) and Cassandra nodes (where data lives).

Is it possible to run Apache Spark on Apache Mesos?

Yes, Apache Spark can be run on the hardware clusters managed by Mesos. In a standalone cluster deployment, the cluster manager in the below diagram is a Spark master instance. When using Mesos, the Mesos master replaces the Spark master as the cluster manager. Mesos determines what machines handle what tasks. Because it takes into account other frameworks when scheduling these many short-lived tasks, multiple frameworks can coexist on the same cluster without resorting to a static partitioning of resources.

How can Spark be connected to Apache Mesos?

To connect Spark with Mesos:

Configure the spark driver program to connect to Mesos.
Spark binary package should be in a location accessible by Mesos.
Install Apache Spark in the same location as that of Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it is installed.

How can you minimize data transfers when working with Spark?

Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:

Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
Using Accumulators – Accumulators help update the values of variables in parallel while executing.

The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.

What are broadcast variables?

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Explain accumulators in Apache Spark.

Accumulators are variables that are only added through an associative and commutative operation. They are used to implement counters or sums. Tracking accumulators in the UI can be useful for understanding the progress of running stages. Spark natively supports numeric accumulators. We can create named or unnamed accumulators.

Why is there a need for broadcast variables when working with Apache Spark?

Broadcast variables are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup().

How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.

What is the significance of Sliding Window operation?

Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.

What is a DStream in Apache Spark?

Discretized Stream (DStream) is the basic abstraction provided by Spark Streaming. It is a continuous stream of data. It is received from a data source or from a processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs and each RDD contains data from a certain interval. Any operation applied on a DStream translates to operations on the underlying RDDs.

DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations:

Transformations that produce a new DStream.
Output operations that write data to an external system.

There are many DStream transformations possible in Spark Streaming. Let us look at filter(func). filter(func) returns a new DStream by selecting only the records of the source DStream on which func returns true.

Explain Caching in Spark Streaming.

DStreams allow developers to cache/ persist the stream’s data in memory. This is useful if the data in the DStream will be computed multiple times. This can be done using the persist() method on a DStream. For input streams that receive data over the network (such as Kafka, Flume, Sockets, etc.), the default persistence level is set to replicate the data to two nodes for fault-tolerance.

When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?

Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

What are the various data sources available in Spark SQL?

Parquet file, JSON datasets and Hive tables are the data sources available in Spark SQL.

What are the various levels of persistence in Apache Spark?

Apache Spark automatically persists the intermediary data from various shuffle operations, however, it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.

The various storage/persistence levels in Spark are:

MEMORY_ONLY: Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re needed. This is the default level.
MEMORY_AND_DISK: Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on disk, and read them from there when they’re needed.
MEMORY_ONLY_SER: Store RDD as serialized Java objects (one byte array per partition).
MEMORY_AND_DISK_SER: Similar to MEMORY_ONLY_SER, but spill partitions that don’t fit in memory to disk instead of recomputing them on the fly each time they’re needed.
DISK_ONLY: Store the RDD partitions only on disk.
OFF_HEAP: Similar to MEMORY_ONLY_SER, but store the data in off-heap memory.

Does Apache Spark provide checkpoints?

Checkpoints are similar to checkpoints in gaming. They make it run 24/7 and make it resilient to failures unrelated to the application logic.

Lineage graphs are always useful to recover RDDs from a failure but this is generally time-consuming if the RDDs have long lineage chains. Spark has an API for checkpointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint – is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.

How Spark uses Akka?

Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.

What do you understand by Lazy Evaluation?

Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it does nothing, unless asked for the final result. When a transformation like map() is called on an RDD, the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.

What do you understand by SchemaRDD in Apache Spark RDD?

SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.

SchemaRDD was designed as an attempt to make life easier for developers in their daily routines of code debugging and unit testing on SparkSQL core module. The idea can boil down to describing the data structures inside RDD using a formal description similar to the relational database schema. On top of all basic functions provided by common RDD APIs, SchemaRDD also provides some straightforward relational query interface functions that are realized through SparkSQL.

Now, it is officially renamed to DataFrame API on Spark’s latest trunk.

How is Spark SQL different from HQL and SQL?

Spark SQL is a special component on the Spark Core engine that supports SQL and Hive Query Language without changing any syntax. It is possible to join SQL table and HQL table to Spark SQL.

Explain a scenario where you will be using Spark Streaming.

When it comes to Spark Streaming, the data is streamed in real-time onto our Spark program.

Twitter Sentiment Analysis is a real-life use case of Spark Streaming. Trending Topics can be used to create campaigns and attract a larger audience. It helps in crisis management, service adjusting and target marketing.

Part 5:

EXPLAIN SHARK.

Shark is for people from a Database background that can help them access Scala MLib through SQL accounting.

CAN YOU EXPLAIN THE MAIN FEATURES OF SPARK APACHE?

Supports several programming languages –Spark can be coded in four programming languages, i.e. Java, Python, R, and Scala. It also offers high-level APIs for them. Additionally, Apache Spark supplies Python and Scala shells.
Lazy Evaluation– Apache Spark uses the principle of lazy evaluation to postpone the evaluation before it becomes completely mandatory.
Machine Learning– The MLib machine learning component of Apache Spark is useful for extensive data processing. It removes the need for different engines for processing and machine learning.
Modern Format Assistance– Apache Spark supports multiple data sources, like Cassandra, Hive, JSON, and Parquet. The Data Sources API provides a pluggable framework for accessing structured data through Spark SQL.
Real-Time Computation– Spark is specifically developed to satisfy massive scalability criteria. Thanks to in-memory computing, Spark’s computing is real-time and has less delay.
Speed– Spark is up to 100x faster than Hadoop MapReduce for large-scale data processing. Apache Spark is capable of achieving this incredible speed by optimized portioning. The general-purpose cluster-computer architecture handles data across partitions that parallel distributed data processing with limited network traffic.
Hadoop Integration– Spark provides seamless access to Hadoop and is a possible substitute for the Hadoop MapReduce functions. Spark is capable of operating on top of the existing Hadoop cluster using YARN for scheduling resources.

WHAT IS APACHE SPARK?

Apache Spark is a data processing framework that can perform processing tasks on extensive data sets quickly. This is one of the most frequently asked Apache Spark interview questions.

EXPLAIN THE CONCEPT OF SPARSE VECTOR.

A vector is a one-dimensional array of elements. However, in many applications, the vector elements have mostly zero values that are said to be sparse.

WHAT IS THE METHOD FOR CREATING A DATA FRAME?

A data frame can be generated using the Hive and Structured Data Tables.

EXPLAIN WHAT SCHEMARDD IS.

A SchemaRDD is similar to a table in a traditional relational database. A SchemaRDD can be created from an existing RDD, Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.

EXPLAIN WHAT ACCUMULATORS ARE.

Accumulators are variables used to aggregate information across the executors.

EXPLAIN WHAT THE CORE OF SPARK IS.

Spark Core is a basic execution engine on the Spark platform.

EXPLAIN HOW DATA IS INTERPRETED IN SPARK?

Data can be interpreted in Apache Spark in three ways: RDD, DataFrame, and DataSet.

NOTE: These are some of the most frequently asked spark interview questions.

HOW MANY FORMS OF TRANSFORMATIONS ARE THERE?

There are two forms of transformation: narrow transformations and broad transformations.

WHAT’S PAIRED RDD?

Paired RDD is a key-value pair of RDDs.

WHAT IS IMPLIED BY THE TREATMENT OF MEMORY IN SPARK?

In memory computing, we retain data in sloppy access memory instead of specific slow disc drives.

NOTE: It is important to know more about this concept as it is commonly asked in Spark Interview Questions.

EXPLAIN THE DIRECTED ACYCLIC GRAPH.

Directed Acyclic Graph is a finite collateral graphic with no alternating disc.

EXPLAIN THE LINEAGE CHART.

Lineage map reports to the graph for the RDD parent as a whole.

EXPLAIN THE IDLE ASSESSMENT IN SPARK.

The idle assessment, known as call by use, is a strategy that defers compliance until one needs a benefit.

EXPLAIN THE ADVANTAGE OF A LAZY EVALUATION.

To expand the program’s manageability and features.

EXPLAIN THE CONCEPT OF “PERSISTENCE”.

RDD persistence is an ideal technique that saves the results of the RDD assessment.

WHAT IS THE MAP-REDUCE LEARNING FUNCTION?

Map Reduce is a model used for a vast amount of data design.

WHEN PROCESSING INFORMATION FROM HDFS, IS THE CODE PERFORMED NEAR THE DATA?

Yes, in most situations, it is. It creates executors that are close to paths that contain data.

DOES SPARK ALSO CONTAIN THE STORAGE LAYER?

No, it doesn’t have a disc layer, but it lets you use many data sources.

These 20 Spark coding interview questions are some of the most important ones! Make sure you revise them before your interview!

WHERE DOES THE SPARK DRIVER OPERATE ON YARN?

The Spark driver operates on the client computer.

HOW IS MACHINE LEARNING CARRIED OUT IN SPARK?

Machine learning is carried out in Spark with the help of MLlib. It’s a scalable machine learning library provided by Spark.

EXPLAIN WHAT A PARQUET FILE IS.

Parquet is a column structure file that is supported by many other data processing classes.

EXPLAIN THE LINEAGE OF THE RDD.

The lineage of RDD is that it does not allow memory duplication of records.

EXPLAIN THE SPARK EXECUTOR.

Executors are worker nodes’ processes in charge of running individual tasks in a given Spark job.

EXPLAIN THE MEANING OF A WORKER’S NODE OR ROUTE.

A worker node or path corresponds to any node that can stick the application symbol in many nodes.

EXPLAIN THE SPARSE VECTOR.

A sparse vector has two parallel formats, one for indices and the other for values.

IS IT POSSIBLE TO STICK WITH THE APACHE SPARK ON APACHE MESOS?

Yes, you should adhere to the clusters of resources that have Mesos.

EXPLAIN THE APACHE SPARK ACCUMULATORS.

Accumulators are predictions that are taken away only by a non-linear method of thinking and alternate processes.

WHY IS THERE A NEED FOR TRANSMITTING VARIABLES WHILE USING APACHE SPARK?

Because it reads, except for variables, the relevant in-memory array on each machine tool.

EXPLAIN THE IMPORT OF SLIDING WINDOW PERFORMANCE.

Sliding Window withholds transmission of numerical information packets between different data networks on machines.

EXPLAIN THE DISCRETIZED STREAM OF APACHE SPARK.

Discretized Stream is a fundamental abstraction acceptable to Spark Streaming.

Make sure you revise these Spark streaming interview questions before moving onto the next set of questions.

STATE THE DISTINCTION BETWEEN SQL AND HQL.

SparkSQL is a critical component of the Spark Core engine, whereas HQL is a combination of OOPS with the Relational database concept.

NOTE: This is one of the most widely asked Spark SQL interview questions.

EXPLAIN THE USE OF BLINK DB.

Blink DB is a query machine tool that helps you to run SQL queries.

EXPLAIN THE NODE OF THE APACHE SPARK WORKER.

The node of a worker is any path that can run the application code in a cluster.

NOTE: This is one of the most crucial Spark interview questions for experienced candidates.

EXPLAIN THE FRAMEWORK OF THE CATALYST.

The Catalyst Concept is a modern optimization framework in Spark SQL.

DOES SPARK USE HADOOP?

Spark has its own cluster administration list and only uses Hadoop for collection.

WHY DOES SPARK USE AKKA?

Spark simply uses Akka for scheduling.

EXPLAIN THE WORKER NODE OR PATHWAY.

A node or route that can run the Spark program code in a cluster can be called a worker or porter node.

EXPLAIN WHAT YOU UNDERSTAND ABOUT THE RDD SCHEMA?

Schema RDD consists of a row factor with schema data in both directions with details in each column.

WHAT IS THE FUNCTION OF SPARK ENGINE?

Spark Engine schedules for distribution and monitoring.

WHICH IS THE APACHE SPARK DEFAULT LEVEL?

The cache() method is used for the default storage level, which is StorageLevel.

CAN YOU USE SPARK TO PERFORM THE ETL PROCESS?

Yes, Spark may be used for the ETL operation as Spark supports Java, Scala, R, and Python.

WHICH IS THE NECESSARY DATA STRUCTURE OF SPARK?

The Data Framework is essential for the fundamental development of Spark data.

CAN YOU FLEE APACHE SPARK ON APACHE MESOS?

Yes, it can flee Apache Spark on the hardware clusters that Mesos charges.

EXPLAIN THE SPARK MLLIB.

MLlib is the acronym of Spark’s scalable machine learning library.

EXPLAIN DSTREAM.

D Stream is a high-level concentration described by Spark Streaming.

WHAT IS ONE ADVANTAGE OF PARQUET FILES?

Parquet files are adequate for large-scale queries.

EXPLAIN THE FRAMEWORK OF THE CATALYST.

The Catalyst is a structure that represents and manipulates a data frame graph.

EXPLAIN THE SET OF DATA.

Spark Datasets is an extension of the Data Frame API.

WHAT ARE DATAFRAMES?

They are a list of data that is arranged in the named columns.

EXPLAIN THE CONCEPT OF THE DDR (RESILIENT DISTRIBUTED DATASET). ALSO, HOW CAN YOU BUILD RDDS IN APACHE SPARK?

The RDD or Resilient Distribution Dataset is a fault-tolerant array of operating elements capable of running parallel. Any partitioned data in the RDD can be distributed. There are two kinds of RDDs:

Hadoop Datasets –Perform functions for each file record in HDFS (Hadoop Distributed File System) or other forms of storage structures.
Parallelized Collections –Extensive RDDs running parallel to each other

There are two ways to build an RDD in Apache Spark:

By paralleling the array in the Driver program. It uses the parallelize() function of SparkContext.
Through accessing an arbitrary dataset from any external storage, including HBase, HDFS, and a shared file system.

DEFINE SPARK.

Spark is a parallel system for data analysis. It allows a quick, streamlined big data framework to integrate batch, streaming, and immersive analytics.

WHY USE SPARK?

Spark is a 3rd gen distributed data processing platform. It’s a centralized big data approach for big data processing challenges such as batch, interactive or streaming processing. It can ease a lot of big data issues.

WHAT IS RDD?

The primary central abstraction of Spark is called Resilient Distributed Datasets. Resilient Distributed Datasets are a set of partitioned data that fulfills these characteristics. The popular RDD properties are immutable, distributed, lazily evaluated, and catchable.

THROW SOME LIGHT ON WHAT IS IMMUTABLE.

If a value has been generated and assigned, it cannot be changed. This attribute is called immutability. Spark is immutable by nature. It does not accept upgrades or alterations. Please notice that data storage is not immutable, but the data content is immutable.

HOW CAN RDD SPREAD DATA?

RDD can dynamically spread data through various parallel computing nodes.

WHAT ARE THE DIFFERENT ECOSYSTEMS OF SPARK?

Some typical Spark ecosystems are:

Spark SQL for developers of SQL
Spark Streaming for data streaming
MLLib for algorithms of machine learning
GraphX for computing of graph
SparkR to work on the Spark engine
BlinkDB, which enables dynamic queries of large data

GraphX, SparkR, and BlinkDB are in their incubation phase.

WHAT ARE PARTITIONS?

Partition is a logical partition of records, an idea taken from Map-reduce (split) in which logical data is directly obtained to process data. Small bits of data can also help in scalability and fasten the operation. Input data, output data & intermediate data are all partitioned RDDs.

HOW DOES SPARK PARTITION DATA?

Spark uses the map-reduce API for the data partition. One may construct several partitions in the input format. HDFS block size is partition size (for optimum performance), but it’s possible to adjust partition sizes like Split.

HOW DOES SPARK STORE DATA?

Spark is a computing machine without a storage engine in place. It can recover data from any storage engine, such as HDFS, S3, and other data services.

IS IT OBLIGATORY TO LAUNCH THE HADOOP PROGRAM TO RUN A SPARK?

It is not obligatory, but there is no special storage in Spark. Thus you must use the local file system to store the files. You may load and process data from a local device. Hadoop or HDFS is not needed to run a Spark program.

WHAT’S SPARKCONTEXT?

When the programmer generates RDDs, SparkContext connects to the Spark cluster to develop a new SparkContext object. SparkContext tells Spark to navigate the cluster. SparkConf is the central element for creating an application for the programmer.

HOW IS SPARKSQL DIFFERENT FROM HQL AND SQL?

SparkSQL is a special part of the SparkCore engine that supports SQL and HiveQueryLanguage without modifying syntax. You will enter the SQL table and the HQL table.

WHEN IS SPARK STREAMING USED?

It is an API used for streaming data and processing it in real-time. Spark streaming collects streaming data from various services, such as web server log files, data from social media, stock exchange data, or Hadoop ecosystems such as Kafka or Flume.

HOW DOES THE SPARK STREAMING API WORK?

The programmer needs to set a specific time in the setup, during which the data that goes into the Spark is separated into batches. The input stream (DStream) goes into the Spark stream.

The framework splits into little pieces called batches, then feeds into the Spark engine for processing. The Spark Streaming API sends the batches to the central engine. Core engines can produce final results in the form of streaming batches. Production is in the form of batches, too. It allows the streaming of data and batch data for processing.

WHAT IS GRAPHX?

GraphX is a Spark API for editing graphics and arrays. It unifies ETL, analysis, and iterative graph computing. Its fastest graphics system offers error tolerance and easy use without the need for special expertise.

WHAT IS FILE SYSTEM API?

The File System API can read data from various storage devices, such as HDFS, S3, or Local FileSystem. Spark utilizes the FS API to read data from multiple storage engines.

WHY ARE PARTITIONS IMMUTABLE?

Each transformation creates a new partition. Partitions use the HDFS API such that the partition is immutable, distributed, and error-tolerant. Partitions are, therefore, conscious of the location of the results.

DISCUSS WHAT IS FLATMAP AND MAP IN SPARK.

A map is a simple line or row to process the data. Each input object can be mapped to various output items in FlatMap (so the function should return a Seq rather than a unitary item). So most often, it is used to return the Array components.

DEFINE BROADCAST VARIABLES.

Broadcast variables allow the programmer to have a read-only variable cached on each computer instead of sending a copy of it with tasks. Spark embraces two kinds of mutual variables: broadcast variables and accumulators. Broadcast variables are stored as Array Buffers, which deliver read-only values to the working nodes.

WHAT ARE SPARK ACCUMULATORS IN CONTEXT TO HADOOP?

Off-line Spark debuggers are called accumulators. Spark accumulators are equivalent to Hadoop counters and can count the number of activities. Only the driver program can read the value of the accumulator, not the tasks.

WHEN CAN APACHE SPARK BE USED? WHAT ARE THE ADVANTAGES OF SPARK OVER MAPREDUCE?

Spark is quite fast. Programs run up to 100x faster than Hadoop MapReduce in memory. It appropriately uses RAM to achieve quicker performance.

In Map Reduce Paradigm, you write many Map-reduce tasks and then link these tasks together using the Oozie/shell script. This process is time-intensive, and the role of map-reducing has a high latency.

Frequently, converting production from one MR job to another MR job can entail writing another code since Oozie might not be enough.

In Spark, you can do anything using a single application/console and get the output instantly. Switching between ‘Running something on a cluster’ and ‘doing something locally’ is pretty simple and straightforward. All this leads to a lower background transition for the creator and increased efficiency.

Spark sort of equals MapReduce and Oozie when put in conjunction.

The above-mentioned Spark Scala interview questions are pretty popular and are a compulsory read before you go for an interview.

IS THERE A POINT OF MAPREDUCE LEARNING?

Yes. It serves the following purposes:

MapReduce is a paradigm put to use by several big data tools, including Spark. So learning the MapReduce model and transforming a problem into a sequence of MR tasks is critical.
When data expands beyond what can fit into the cluster memory, the Hadoop Map-Reduce model becomes very important.
Almost every other tool, such as Hive or Pig, transforms the query to MapReduce phases. If you grasp the Mapreduce, you would be better able to refine your queries.

WHAT ARE THE DRAWBACKS OF SPARK?

Spark uses memory. The developer needs to be cautious about this. Casual developers can make the following mistakes:

It might end up running everything on the local node instead of spreading work to the cluster.
It could reach some web services too many times by using multiple clusters.
The first dilemma is well addressed by the Hadoop Map reduce model.
A second error is also possible in Map-Reduce. When writing Map-Reduce, the user can touch the service from the inside of the map() or reduce() too often. This server overload is also likely when using Spark.

Apache Spark Interview Question and Answer

Bybpci

Apache Spark Interview Q & A

By bpci

Related Post

Apache Spark Interview Question and Answer

UGC NET June 2025 city intimation slip releasing soon: Check how to download and admit card details here – Times of India

How UK paternity leave compares to the rest of Europe

12.37 lakh clear NEET, no perfect scorers

NEET-UG results declared: Two in top 10 from Pune, Mumbai; Baramati girl ranked 26

Rajasthan BSTC Result 2025 Declared: Check how to download scorecard and counselling details here – Times of India

UGC NET June 2025 city intimation slip releasing soon: Check how to download and admit card details here – Times of India

How UK paternity leave compares to the rest of Europe

12.37 lakh clear NEET, no perfect scorers

NEET-UG results declared: Two in top 10 from Pune, Mumbai; Baramati girl ranked 26

Latest Education News Update

Rajasthan BSTC Result 2025 Declared: Check how to download scorecard and counselling details here – Times of India

UGC NET June 2025 city intimation slip releasing soon: Check how to download and admit card details here – Times of India

How UK paternity leave compares to the rest of Europe

12.37 lakh clear NEET, no perfect scorers