Apache Spark

  • Implement in Scala
  • Parallel execution framework 
  • supports to perform big data sets
  • Partition
    • Just a data
  • Task
    • Java code executing on Partition data
  • RDD
    • Resilient Distributed Dataset
  • DAG
    • Directed Acyclic Graph
  • POM dependencies
    • Spark-core
    • Spark-sql
    • Hadoop-hdfs

Initial dataset

JavaRDD<Integer>  myRdd=sc.parallelize(inputdata);

Reduce

  • Takes input as 2 variables and returns output of same type 
  • eg: result = myRdd.reduce((value1,value2)-> value1+value2);

Mapping

  • return type can be different from input data
  • Transform rdd structure from one form to another
  •  eg:  JavaRDD<double> sqrtRdd= myRdd.map(value-> Math.sqrt(value));

 Procedure

  • sqrtRdd. forEach(value -> print(value));

Collect

  • Get all data from different nodes to current working node

Tuples

  • storing related objects together instead of having a new class
  • eg  var itmes = ("one","two","three")

 PairRDD

  • PairRDD allows rich operations against keys
  • Group by key
    • produces another RDD of type PairRDD<String, Iterable<String>>
  • Reduce by key
    • produces another RDD of type PairRDD<String, Integer>  //group by to get count by key
  • Combine by key
    • Need 3 inputs
      • Create Combiner
        • converts any value into key, value pair
      • Combiner by function
        • combines values within a partition
      • Merge function
        • combines two partitions
  • Aggregate by key
  • Fold by key
  • sort by key
  • All type of joins
  • cogroup 
  • distinct
  • similar to table with 2 columns
  • first with Key and 2nd with Value
  • methods 
    • Group BY (this is not performant - avoid using it)
    • Reduce By Key
      • similar to group by KEY an

FlatMaps

  • split sentences in to words

Other methods

  • take(n)  //take first n elements
  •  

Streaming

  • Live data processing in real time- at same time the data is being generated
  • Two types of streams
    • DStreams based on RDD api
    • Structured Streaming- based on SparkSQL engine

Comments