Apache Spark

Apache Spark

November 19, 2020

Implement in Scala
Parallel execution framework
supports to perform big data sets
Partition

Just a data

Task

Java code executing on Partition data

RDD

Resilient Distributed Dataset

DAG

Directed Acyclic Graph

POM dependencies

Spark-core
Spark-sql
Hadoop-hdfs

Initial dataset

JavaRDD<Integer> myRdd=sc.parallelize(inputdata);

Reduce

Takes input as 2 variables and returns output of same type
eg: result = myRdd.reduce((value1,value2)-> value1+value2);

Mapping

return type can be different from input data
Transform rdd structure from one form to another
eg: JavaRDD<double> sqrtRdd= myRdd.map(value-> Math.sqrt(value));

Procedure

sqrtRdd. forEach(value -> print(value));

Collect

Get all data from different nodes to current working node

Tuples

storing related objects together instead of having a new class
eg var itmes = ("one","two","three")

PairRDD

PairRDD allows rich operations against keys
Group by key

produces another RDD of type PairRDD<String, Iterable<String>>

Reduce by key

produces another RDD of type PairRDD<String, Integer> //group by to get count by key

Combine by key

Need 3 inputs

Create Combiner

converts any value into key, value pair

Combiner by function

combines values within a partition

Merge function

combines two partitions

Aggregate by key
Fold by key
sort by key
All type of joins
cogroup
distinct
similar to table with 2 columns
first with Key and 2nd with Value
methods

Group BY (this is not performant - avoid using it)
Reduce By Key

similar to group by KEY an

FlatMaps

split sentences in to words

Other methods

take(n) //take first n elements

Streaming

Live data processing in real time- at same time the data is being generated
Two types of streams

DStreams based on RDD api
Structured Streaming- based on SparkSQL engine

Comments