Big Data:
Batch and
real time
For downloading you have to do
it by latest which linux flavor
Ubantu.15.10 downloads this
then vm player
Vmplayer
vmware –
http://www.vmware.com/my_maintenance.html#desktop_end_user_computing/vmware_player/7_0
Select ram as around 6GB in vmplayer
Then update ubantu 64 bit .iso file
Where data is using in traditional systems
RDBMS
This is only pure for business profit oriented process like transactional
data
Web + IOT (internet of things )
Web can generate huge data and the devices to
connect internet then generate data this is called as IOT data
80% of data comes from IOT
Batch processing: it’s processing defining series of jobs
For example in bank transactions having batches
some interval all the NEFT/RTGS transactions triggers
Input data can collect over time of period its
non-continues process data
Batch window: log analysis and send back the
response then do the recommendation this is going to be delay process. For
systems performance also can do
In the log analysis they can do the KPI (key
Performance indicators)
This is purely at batch
Another example of – billing applications
Backups – not critical hours they are going to
trace backups
For this also batch jobs
We require fast response
Batch window
Main frames is the best batch processing system
Offer recommendations – past data generated and
giving the recommendations
Batch processing is complex when its use while
produce faster results
If data is large its won’t be process faster
Real-time
Data: changes the data
continually
Small differences like mille seconds
Latency:
time interval b/w request and response this is called as latency
Low Latency: High response time
Air
Traffic
Hospital
Bank
ATM
Currently
this doing real time
Example:
APP
nexus
This is
3rd party app vendor to give
the recommendation based on the credit of users ..how this going to do
AppNexus
having the customer credit data how is bit his customer having good credit
score based on that they do recommendation pops in the websites
In-Memory
Processing:
Bigdata
-- > batch Dataà Hadoop
Bigdataà real time à Storm – can do real time not batch
jobs
Hadoop
for disk computations
Sparkà can do to both batch and real time
processing
Data pipeline:
If you
want build any data pipe line you can use do the following things
Any big
data process
1>
Injecting
– Flume,Kafka,Sqoop
2>
Storage
-- HDFS,Hbase,Cassandra,
3>
Processing
– Pre processing , Post Processing MR,Hive,PIG,Sprak
4>
Schedule
-- OOZIE
5>
Model
(post processing Tools) – Post processing – BI Tools QlickView,tableu..so on
For
doing this requires lot of integration
MLLIB –
Machine learning libraries
Mahout –
machine learning libraries
Bases
projects of spark like
è Injecting the data
è Kafka
è Spark streaming
è Store data into Cassandra
è Spark jobs
Applying
regular expressions
Iterative
algorithms
Apache
Spark – core
Spark
SQL
Spark
Streaming
MLIB
Hadoop
written in java
Spark
written by Scala
Scala
have lot of advantages
Object
oriented and functional programming language
2000
lines in java you can make this as 100 lines
One
function can call other functions
Many of
the inline functions
Fun1(fun2)...like
this
First we
need to learn scala then spark
Spark –
memory computations
Hadoop –
data processing computation engine
1.5
version use phython/scala
2015 –
data frames
100%
better than Hadoop
Word
count in java here 2 lines in scala
is disk
lost in spark we can take the back up using linear graph
Hadoop
can run
Standalone
Psudeo
distributed
Fully
distributed
Spark
run on below
Standalone -----
YARN - --- MESOS
RDD – in
memory of machines