Overview of Big Data & Spark

Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it.

Big data has become viable as cost-effective approaches have emerged to tame the volume, velocity, and variability of massive data.

We live in the data age. It’s not easy to measure the total volume of data stored electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in 2013 and grown by 44 zettabytes in 2020. A zettabyte is equivalently one thousand exabytes, one million petabytes, or one billion terabytes. That’s more than one disk drive for every person in the world.

This flood of data is coming from many sources. Consider the following:

  • The New York Stock Exchange generates about 4-5 terabytes of data per day.
  • Facebook hosts more than 240 billion photos, growing at 7 petabytes per month.
  • Ancestry.com, the genealogy site, stores around 10 petabytes of data.
  • The Internet Archive stores around 18.5 petabytes of data.

Apache Hadoop:

Apache Hadoop has been the driving force behind the growth of the big data industry. Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure in commodity class hardware in a highly fault-tolerant, scalable and a flexible way.

Hadoop offers two solutions for making Hadoop programming easier.  Hive is programming language which enables Hadoop to operate as a data warehouse. It superimposes structure on data in HDFS and then permits queries over the data using a familiar SQL-like syntax.

It also provide a programming model/framework Map-reduce for parallel processing and fault-tolerant.

Introduction of Spark

Every organization that used MapReduce, brand new applications could be built using the existing data, however, the MapReduce engine made it both challenging and inefficient to build large applications. For example, the typical machine learning algorithm might need to make 10 or 20 passes over the data, and in MapReduce, each pass had to be written as a separate MapReduce job, which had to be launched separately on the cluster and load the data from scratch.

To address this problem, the Spark has been designed an API based on functional programming that could succinctly express multistep applications  that could perform efficient, in-memory data sharing across computation steps.

Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. As of this writing, Spark is the most actively developed open source engine for this task, making it a standard tool for any developer or data scientist interested in big data.

Spark supports multiple widely used programming languages (Python, Java, Scala, and R), includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere from a laptop to a cluster of thousands of servers. This makes it an easy system to start with and scale-up to big data processing or incredibly large scale.

Spark’s key driving goal is to offer a unified platform for writing big data applications. Spark is designed to support a wide range of data analytics tasks, ranging from simple data loading and SQL queries to machine learning and streaming computation, over the same computing engine and with a consistent set of APIs. The main insight behind this goal is that real-world data analytics tasks.

Iterative Operations on MapReduce vs Spark RDD

Reuse intermediate results across multiple computations in multi-stage applications. The following illustration explains how the current framework works, while doing the iterative operations on MapReduce. This incurs substantial overheads due to data replication, disk I/O, and serialisation, which makes the system slow.

[Ref :  https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm]

The illustration given below shows the iterative operations on Spark RDD. It will store intermediate results in a distributed memory instead of Stable storage (Disk) and make the system faster.

[Ref :  https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm]

Spark & its Features

Apache Spark is an open source cluster computing framework for real-time data processing. The main feature of Apache Spark is its in-memory cluster computing that increases the processing speed of an application. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries, and streaming.

  • Speed
    Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing. It is also able to achieve this speed through controlled partitioning.
  • Powerful Caching
    Simple programming layer provides powerful caching and disk persistence capabilities.
  • Deployment
    It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster manager.
  • Real-Time
    It offers Real-time computation & low latency because of in-memory computation.
  • Polyglot
    Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written in any of these four languages. It also provides a shell in Scala and Python.

 Spark Components

Spark components are what make Apache Spark fast and reliable. A lot of these Spark components were built to resolve the issues that cropped up while using Hadoop MapReduce. 

Apache Spark has the following components:

  • Spark Core
  • Spark Streaming
  • Spark SQL
  • GraphX
  • MLlib (Machine Learning)
[Ref : https://www.oreilly.com/library/view/learning-spark/9781449359034/ch01.html]

Spark Core

Spark Core is the base engine for large-scale parallel and distributed data processing. The core is the distributed execution engine, and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. Further, additional libraries which are built atop the core allow diverse workloads for streaming, SQL, and machine learning. It is responsible for:

  • Memory management and fault recovery.
  • Scheduling, distributing and monitoring jobs on a cluster.
  • Interacting with storage systems.

Spark Streaming

Spark Streaming is the component of Spark which is used to process real-time streaming data. Thus, it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams. The fundamental stream unit is DStream which is basically a series of RDDs (Resilient Distributed Datasets) to process the real-time data.

Spark SQL

Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing. 

Spark SQL integrates relational processing with Spark’s functional programming. Further, it provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.

Spark Architecture Overview

Apache Spark has a well-defined layered architecture where all the spark components and layers are loosely coupled. This architecture is further integrated with various extensions and libraries. Apache Spark Architecture is based on two main abstractions:

  • Resilient Distributed Dataset (RDD)
  • Directed Acyclic Graph (DAG)

High-Level Architecture

[ Ref: https://www.analyticsvidhya.com/blog/]

How Spark Application works

  • When a Spark application is run, Spark connects to a cluster manager and acquires executors on the worker nodes.
  • Spark splits a job into a directed acyclic graph (DAG) of stages. It then schedules the execution of these stages on the executors using a low-level scheduler provided by a cluster manager.
  • The executors run the tasks submitted by Spark in parallel.
  • Every Spark application gets its own set of executors on the worker nodes. This design provides a few benefits.
  • First, tasks from different applications are isolated from each other since they run in different JVM processes.
  • Second, scheduling of tasks becomes easier. Spark has to schedule the tasks belonging to only one application at a time. It does not have to handle the complexities of scheduling tasks from multiple concurrently running applications.

Spark in the Real World

Online advertisers and companies such as Netflix are leveraging Spark for insights and competitive advantage. Other notable businesses also benefiting from Spark are:

  • Uber – Every day this multinational online taxi dispatch company gathers terabytes of event data from its mobile users. By using Kafka, Spark Streaming, and HDFS, to build a continuous ETL pipeline, Uber can convert raw unstructured event data into structured data as it is collected, and then use it for further and more complex analytics.
  • Pinterest – Through a similar ETL pipeline, Pinterest can leverage Spark Streaming to gain immediate insight into how users all over the world are engaging with Pins—in real-time. As a result, Pinterest can make more relevant recommendations as people navigate the site and see related Pins to help them select recipes, determine which products to buy or plan trips to various destinations.
  • Conviva – Averaging about 4 million video feeds per month, this streaming video company is second only to YouTube. Conviva uses Spark to reduce customer churn by optimizing video streams and managing live video traffic—thus maintaining a consistently smooth, high-quality viewing experience.

References :

Spark: The Definitive Guide and qubole.com/blog/apache-spark-use-cases


Written by Radhakrishna Uppugunduri