Apache Spark
- Implement in Scala
- Parallel execution framework
- supports to perform big data sets
- Partition
- Just a data
- Task
- Java code executing on Partition data
- RDD
- Resilient Distributed Dataset
- DAG
- Directed Acyclic Graph
- POM dependencies
- Spark-core
- Spark-sql
- Hadoop-hdfs
Initial dataset
JavaRDD<Integer> myRdd=sc.parallelize(inputdata);
Reduce
- Takes input as 2 variables and returns output of same type
- eg: result = myRdd.reduce((value1,value2)-> value1+value2);
Mapping
- return type can be different from input data
- Transform rdd structure from one form to another
- eg: JavaRDD<double> sqrtRdd= myRdd.map(value-> Math.sqrt(value));
Procedure
- sqrtRdd. forEach(value -> print(value));
Collect
- Get all data from different nodes to current working node
Tuples
- storing related objects together instead of having a new class
- eg var itmes = ("one","two","three")
PairRDD
- PairRDD allows rich operations against keys
- Group by key
- produces another RDD of type PairRDD<String, Iterable<String>>
- Reduce by key
- produces another RDD of type PairRDD<String, Integer> //group by to get count by key
- Combine by key
- Need 3 inputs
- Create Combiner
- converts any value into key, value pair
- Combiner by function
- combines values within a partition
- Merge function
- combines two partitions
- Aggregate by key
- Fold by key
- sort by key
- All type of joins
- cogroup
- distinct
- similar to table with 2 columns
- first with Key and 2nd with Value
- methods
- Group BY (this is not performant - avoid using it)
- Reduce By Key
- similar to group by KEY an
FlatMaps
- split sentences in to words
Other methods
- take(n) //take first n elements
-
Streaming
- Live data processing in real time- at same time the data is being generated
- Two types of streams
- DStreams based on RDD api
- Structured Streaming- based on SparkSQL engine
Comments
Post a Comment