Tuesday, December 22, 2015

A Role of Spark in Big Data introduction

Big Data:
Batch    and    real time
For downloading you have to do it by latest which linux flavor
Ubantu.15.10 downloads this then vm player

Vmplayer
vmware –
 
http://www.vmware.com/my_maintenance.html#desktop_end_user_computing/vmware_player/7_0

Select ram as around 6GB in vmplayer
Then update ubantu 64 bit .iso file
Where data is using in traditional systems
RDBMS
This is only pure for business  profit oriented process like transactional data
Web + IOT (internet of things )
Web can generate huge data and the devices to connect internet then generate data this is called as IOT data
80% of data comes from IOT


Batch processing:  it’s processing defining series of jobs

For example in bank transactions having batches some interval all the NEFT/RTGS transactions triggers   
Input data can collect over time of period its non-continues process data
Batch window: log analysis and send back the response then do the recommendation this is going to be delay process. For systems performance also can do
In the log analysis they can do the KPI (key Performance indicators)
This is purely at batch
Another example of – billing applications
Backups – not critical hours they are going to trace backups
For this also batch jobs
We require fast response
Batch window


Main frames is the best batch processing system
Offer recommendations – past data generated and giving the recommendations
Batch processing is complex when its use while produce faster results
If data is large its won’t be process faster
Real-time Data:  changes the data continually
Small differences like mille seconds
Latency: time interval b/w request and response this is called as latency
  Low Latency:  High response time

Air Traffic
Hospital
Bank ATM

Currently this doing real time
Example:
APP nexus
This is 3rd party  app vendor to give the recommendation based on the credit of users ..how this going to do

AppNexus having the customer credit data how is bit his customer having good credit score based on that they do recommendation pops in the websites


In-Memory Processing:

Bigdata -- > batch Dataà Hadoop
Bigdataà real time à Storm – can do real time not batch jobs

Hadoop for disk computations
Sparkà can do to both batch and real time processing

Data pipeline:
If you want build any data pipe line you can use do the following things

Any big data process
1>      Injecting – Flume,Kafka,Sqoop
2>      Storage  -- HDFS,Hbase,Cassandra,
3>      Processing – Pre processing , Post Processing MR,Hive,PIG,Sprak
4>      Schedule -- OOZIE
5>      Model (post processing Tools) – Post processing – BI Tools QlickView,tableu..so on
For doing this requires lot of integration

MLLIB – Machine learning libraries
Mahout – machine learning libraries

Bases projects of spark like

è  Injecting the data
è  Kafka
è  Spark streaming
è  Store data into Cassandra
è  Spark jobs
Applying regular expressions
Iterative algorithms
Apache Spark – core
Spark SQL
Spark Streaming
MLIB

Hadoop written in java
Spark written by Scala

Scala have lot of advantages

Object oriented and functional programming language
2000 lines in java you can make this as 100 lines
One function can call other functions
Many of the inline functions
Fun1(fun2)...like this
First we need to learn scala then spark

Spark – memory computations
Hadoop – data processing computation engine
1.5 version use phython/scala
2015 – data frames
100% better than Hadoop

Word count in java here 2 lines in scala

is disk lost in spark we can take the back up using linear graph
Hadoop can run
Standalone
Psudeo distributed
Fully distributed


Spark run on below
Standalone     -----   YARN    - ---  MESOS


RDD – in memory of machines