Introduction to Apache Spark, SparkQL, and Spark MLib.

In this blog, i focus on Apache Spark , features and limitations of Apache Spark , architecture of Apache Spark, architecture of SparkQL, and architecture of Spark MLib . Let’s start by understanding what is Apache Spark,

Q.What is Apache Spark?

Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters.

Why Apache Spark?

  1. Fast Processing: Spark contains Resilient Distributed datasets (RDD) which saves time taken in reading and writing opertions and hence, hence it runs almost ten to hundred times faster than hadoop.
  2. In-memory computing: In spark, data is stored in the RAM, so it can access the data quickly and accelerate the speed of analytics.
  3. Flexible: Spark supports multiple languages and allows the developers to write applications in Java, Scala, R or Python.
  4. Fault tolerance: Spark contains Resillent Distributed Datasets(RDD) that are designed to handle the failure of any worker node in the cluster. THus, it ensures that the loss of data reduces to zero.
  5. Better analytics: Spark has a rich set of SQL queries, machine learning algorithms, complex analytics. With all these functionalities can be performed better.

Shortcoming of MapReduce:

  1. Forces your data processing into Map and Reduce
    • Other workflows missing include join, filter, flatMap, groupByKey, union, intersection, …
  2. Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
  3. Read and write to Disk before and after Map and Reduce (stateless machine)
    • Not efficient for iterative tasks, i.e. Machine Learning
  4. Only Java natively supported
  5. Only for Batch processing
    • Interactivity, streaming data

How to does Apache spark solve these shortcomings?

  1. Capable of leveraging the Hadoop ecosystem, e.g. HDFS, YARN, HBase, S3, …
  2. Has many other workflows, i.e. join, filter, flatMapdistinct, groupByKey, reduceByKey, sortByKey, collect, count, first…
  3. In-memory caching of data (for iterative, graph, and machine learning algorithms, etc.)
  4. Native Scala, Java, Python, and R support
  5. Supports interactive shells for exploratory data analysis
  6. Spark API is extremely simple to use


Architecture of Spark

Spark is accessible, intense, powerful and proficient Big Data tool for handling different enormous information challenges. Apache Spark takes after an ace/slave engineering with two primary Daemons and a Cluster Manager

  1. Master Daemon – (Master/Driver Process)
  2. Worker Daemon – (Slave Process)


A spark cluster has a solitary Master and many numbers of Slaves/Workers. The driver and the agents run their individual Java procedures and users can execute them on individual machines. Below are the three methods of building Spark with Hadoop Components –

  1. Standalone – The arrangement implies Spark possesses the place on the top of HDFS(Hadoop Distributed File System) and space is allotted for HDFS, unequivocally. Here, Spark and MapReduce will run one next to the other to covering all in the form of Cluster.
  2. Hadoop Yarn – Hadoop Yarn arrangement implies, basically, Spark keeps running on Yarn with no pre-establishment or root get to required. It incorporates Spark into the Hadoop environment or the Hadoop stack. It enables different parts to keep running on the top of the stack having an explicit allocation for HDFS.
  3. Spark in MapReduce – Spark in MapReduce is utilized to dispatch start work notwithstanding independent arrangement. With SIMR, the client can begin Spark and uses its shell with no regulatory access.


Abstractions in Spark Architecture

In this architecture, all the components and layers are loosely coupled. These components are integrated with several extensions as well as libraries. There are mainly two abstractions on which spark architecture is based. They are:

  1. Resilient Distributed Datasets (RDD): These are the collection of object which is logically partitioned.
    1. It supports in-memory computation over spark cluster. Spark RDDs are immutable in nature.
    2. It supports Hadoop datasets and parallelized collections.
    3. Hadoop Datasets are created from the files stored on HDFS. Parallelized collections are based on existing scala collections.
    4. As RDDs are immutable, it offers two operations transformations and actions.
  2. Directed Acyclic Graph (DAG): Directed- Graph which is directly connected from one node to another. This creates a sequence.
    1. Acyclic – It defines that there is no cycle or loop available.
    2. Graph – It is a combination of vertices and edges, with all the connections in a sequence
    3. We can call it a sequence of computations, performed on data. In this graph, edge refers to transformation on top of data. while vertices refer to an RDD partition. This helps to eliminate the Hadoop mapreduce multistage execution model. It also provides efficient performance over Hadoop.


Spark Execution Flow :

  1. Application Jar: User program and its dependencies except Hadoop & Spark Jars bundled into a Jar file.
  2. Driver Program: The process to start the execution (main() function)
    1. Main process co-ordinated by the SparkContext object.
    2. Allows to configure any spark process wth specific parameters
    3. Spark actions are executed in the Driver
  3. Cluster Manager: An external service to manage resources on the cluster ( YARN)
    1. External services for acquring resources on the cluster
    2. Variety of cluster managers such as Local, Standalone, and Yarn
  4. Deploy Mode
    1. Cluster: Driver inside the cluster, framework launches the driver inside of the cluster.
    2. Client: Driver outside of cluster
  5. Worker Node: any node that run the application program in cluster.
  6. Executor: A process launched for an application on a worker node that run tasks and keeps data in memory or disk storage across them. Each application has its own executors.
  7. Task: a unit of work that will be sent to executor
  8. Job: a parallel computation consisting of multiple tasks that gets spawned in response to a spark action.
  9. Stage: Each job is divided into smaller set of tasks called stages that is sequential and depend on each other
  10. Spark Context: represents the connection to a spark cluster, and can be used to create RDDs accumulator and broadcast variables on that cluster.
    1. Main entry point for Spark functionality
    2. Represents the connection to a spark cluster
    3. Tells spark how & where to access a cluster
    4. Can be used to create RDDs, accumulators and broadcast variables on that cluster.

Resillient Distributed dataset (RDD)

Resilient Distributed dataset (RDD) is a basic abstraction in spark. Immutable, Partitioned collection of elements that can be operated in parallel.

Main characteristics of RDD

  1. A list of partitions
  2. A function for computing each split
  3. A list of dependencies an other RDDs
  4. Optionally,a partioner for key-value RDDs.
  5. A list of preferred locations to compute each split on.

Custom RDD can be also implemented (by overriding functions)


  1. Transformations: These are functions that accept the existing RDDs as input and outputs one or more RDDs. However, the data in the existing RD in Spark does not change as it is immutable.
    1. These transformation are executed when they are invoked or called. Every time transformation are applied, a new RDD is created.
  2. Actions: These are functions that return the end result of RDD computations. It uses a lineage graph to load data onto the RDD in particular order. After all of the transformations are done, actions return the final result to the Spark driver. Actions are Operations that provide non-RDD values.

Features of RDD

  • RDD in Apache Spark is an immutable collection of objects which computes on the different node of the cluster.
  • Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to recompute missing or damaged partitions due to node failures.
  • Distributed, since Data resides on multiple nodes.
  • Dataset represents records of the data you work with. The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.
  • each and every dataset in RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster. RDDs are fault tolerant i.e. It posses self-recovery in the case of failure.

Dependencies of RDD

  1. Narrow dependencies: Each parition of the parent RDD is used by at most one parition of the child RDD. Task can be executed locally and we don’t have to shuffle.
    1. Pipelined execution
    2. Efficient recovery
    3. No shuffling
  2. Wide dependencies: multiple child partitions may depend on one partition of the parent RDD. his means we have to shuffle data unless the parents are hash-partitioned.
    1. Requires shuffling unless parents are hash-partitioned
    2. Expensive to recover


How to create a RDD?

  1. By loading an external dataset: You can load an external file onto an RDD. The types of you can load are csv, txt , json, etc. Here is the example of loading a text file onto an RDD.
  2. By paralleling the collection of objects: When spark’s parallelize method is applied to a group of elements, a new distributed dataset is created. This dateset is an RDD.
  3. By Performing Transformation on the existing RDDs: One or more RDDs can be created by performing transformations on the existing RDDs as mentioned earlier in this tutorial page.

Spark SQL

Spark SQL integrates relational processing with Spark’s functional programming. It provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.

Spark SQL originated as Apache Hive to run on top of Spark and is now integrated with the Spark stack. Apache Hive had certain limitations Spark SQL was built to overcome these drawbacks and replace Apache Hive.

Spark SQL is faster than Hive when it comes to processing speed. Spark SQL is an Apache Spark module used for structured data processing, which:

  1. Acts as a distributed SQL query engine
  2. Provides DataFrames for programming abstraction
  3. Allows to query structured data in Spark programs
  4. Can be used with platforms such as Scala, Java, R, and Python.

Features of Spark SQL:

  1. Integrated − Seamlessly mix SQL queries with Spark programs. Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark, with integrated APIs in Python, Scala and Java. This tight integration makes it easy to run SQL queries alongside complex analytic algorithms.
  2. Unified Data Access − Load and query data from a variety of sources. Schema-RDDs provide a single interface for efficiently working with structured data, including Apache Hive tables, parquet files and JSON files.
  3. Hive Compatibility − Run unmodified Hive queries on existing warehouses. Spark SQL reuses the Hive frontend and MetaStore, giving you full compatibility with existing Hive data, queries, and UDFs. Simply install it alongside Hive.
  4. Standard Connectivity − Connect through JDBC or ODBC. Spark SQL includes a server mode with industry standard JDBC and ODBC connectivity.
  5. Scalability − Use the same engine for both interactive and long queries. Spark SQL takes advantage of the RDD model to support mid-query fault tolerance, letting it scale to large jobs too. Do not worry about using a different engine for historical data.

Limitations of a Apache Hive:

  1. Hive uses MapReduce which lags in performance with medium and small sized datasets.
  2. No resume capability
  3. Hive cannot drop encrypted database

Spark SQL was built to overcome the limitations of Apache Hive running on top of Spark. Spark SQL uses the metastore services of Hive to query the data stored and manged by Hive.


Spark SQL Architecture

  1. Language API: Spark is very compatible as it supports languages like Python, Scala and Java.
  2. Schema RDD: As Spark SQL works on schema, tables and records you can use Schema RDD or dataframe as a temporary table.
  3. Data Sources: Spark SQL supports multiple data sources like JSON, Cassandra database, Hive tables.
  4. Data Source API is used to read and store structure and semi structured data into Spark SQL:
    1. Structured/Semi-structure data
    2. Multiple formats
    3. 3 rd party integrations
  5. DataFrame APi converts the data that is read through Data source API into tabular colmns to help perform SQL operations.
    1. Distributed collection of data organized into named columns
    2. Equivalent to a relational table in SQL
    3. Lazily evaluated
  6. SQL Interpreter and Optimised handles the functional programming part of Spark SQL. it transforms the Data frames RDDs to get the required results in the required formats.
    1. Functional programming
    2. transforming trees
    3. Faster than rdds
    4. Processes all size data


SQL service is the entry point for working along structured data in spark, and is used to fetch the result from the interpreted and optimised data.

Spark MLib

Mlib stands for Machine learnign library in Spark. The goal of this library is to make practical machine learning scalable and easy to implement.

It contains fast and scalable implementations of standard machine learning algorithms. Through Spark MLlib, data engineers and data scientists have access to different types of statistical analysis, linear algebra and various optimization primitives. Spark Machine Learning library MLlib contains the following applications

  1. Collaborative Filtering for Recommendations – Alternating Least Squares
  2. Regression for Predictions – Logistic Regression, Lasso Regression, Ridge Regression, Linear Regression and Support Vector Machines (SVM).
  3. Clustering – Linear Discriminant Analysis, K-Mean and Gaussian,
  4. Classification Algorithms – Naïve Bayes, Ensemble Methods, and Decision Trees.
  5. Dimensionality Reduction –PCA (Principal Component Analysis) and Singular Value Decomposition (SVD)

Benefits of Spark MLib:

  • Spark MLlib is tightly integrated on top of Spark which eases the development of efficient large-scale machine learning algorithms as are usually iterative in nature.
  • MLlib is easy to deploy and does not require any pre-installation, if Hadoop 2 cluster is already installed and running.
  • Spark MLlib’s scalability, simplicity, and language compatibility helps data scientists solve iterative data problems faster.
  • MLlib provides ultimate performance gains (about 10 to 100 times faster than Hadoop and Apache Mahout).


Features of Spark MLlib Library:

  • MLlib provides algorithmic optimisations for accurate predictions and efficient distributed learning.
    • For instance, the alternating least squares machine learning algorithms for making recommendations effectively uses blocking to reduce JVM garbage collection overhead.
  • MLlib benefits from its tight integration with various spark components.
  • MLlib provides a package called to simplify the development and performance tuning of multi-stage machine learning pipelines.
    • MLlib provides high-level API’s which help data scientists swap out a standard learning approach instead of using their own specialized machine learning algorithms.
  • MLlib provides fast and distributed implementations of common machine learning algorithms along with a number of low-level primitives and various utilities for statistical analysis, feature extraction, convex optimizations, and distributed linear algebra.
  • Spark MLlib library has extensive documentation which describes all the supported utilities and methods with several spark machine learning example codes and the API docs for all the supported languages.

Spark Mlib Tools:

  1. Ml Algorithms: classfication, regrssion, clustering and collaborative filtering
  2. Featurization: feature extraction, transformation, dimensionality reduction, and selection
  3. Pipelines: tools for constructing, evaluating and tunnignML pipelines
  4. Persistence: saving and loading algorithms, models and pipelines
  5. Utilities: linear algebra, statistics, data handling


Machine learning Pipelines componenets:

  1. Data frame: The ML API uses Dataframe from Spark SQL as a dataset, which can be used to hold a variety of datatypes
  2. Transformer: This is used to transform one Dataframe to another Dataframe. Examples are
    1. Hashing Term Frequency: This calculates how wordoccurs
    2. Logistic Regression Model: The model which resultsfrom trying logistic regressions on a dataset
    3. Binarizer: This changes a given threshold value to 1or 0
  3. Estimator: It is an algorithm which can be used on a Dataframe to produce Transformer. Examples are:
    1. Logistic Regression: It is used to determine the weights for the resulting Logistic Regression Model by processing the dataframe
    2. Standard Scaler: It is used to calculate the Standard deviation
    3. Pipeline: Calling fit on a pipeline produces pipeline model, and the pipeline contains only transformers and not the estimators
  4. Pipeline: A pipeline chains multiple Transformers and Estimators together to specify the ML workflow
  5. Parameters: To specify the parameters a common API is used by the Transformers and Estimators

For more information about Hadoop and HDFS, Hive


Source link