Apache Spark
Implement in Scala Parallel execution framework supports to perform big data sets Partition Just a data Task Java code executing on Partition data RDD Resilient Distributed Dataset DAG Directed Acyclic Graph POM dependencies Spark-core Spark-sql Hadoop-hdfs Initial dataset JavaRDD<Integer> myRdd=sc.parallelize(inputdata); Reduce Takes input as 2 variables and returns output of same type eg: result = myRdd.reduce((value1,value2)-> value1+value2); Mapping return type can be different from input data Transform rdd structure from one form to another eg: JavaRDD<double> sqrtRdd= myRdd.map(value-> Math.sqrt(value)); Procedure sqrtRdd. forEach(value -> print(value)); Collect Get all data from different nodes to current working node Tuples storing related objects together instead of having a new class eg var itmes = ("one","two","three") PairRDD PairRDD allows rich operations against keys Group by key produces another RDD of t...