Hadoop Real Time Technical Questions and Answers for getting the job: December 2015

Big Data:

Batch and real time

For downloading you have to do it by latest which linux flavor

Ubantu.15.10 downloads this then vm player

Ubuntu-> http://www.ubuntu.com/download/desktop

Vmplayer

vmware –

http://www.vmware.com/my_maintenance.html#desktop_end_user_computing/vmware_player/7_0

Select ram as around 6GB in vmplayer

Then update ubantu 64 bit .iso file

Where data is using in traditional systems

RDBMS

This is only pure for business profit oriented process like transactional data

Web + IOT (internet of things )

Web can generate huge data and the devices to connect internet then generate data this is called as IOT data

80% of data comes from IOT

Batch processing: it’s processing defining series of jobs

For example in bank transactions having batches some interval all the NEFT/RTGS transactions triggers

Input data can collect over time of period its non-continues process data

Batch window: log analysis and send back the response then do the recommendation this is going to be delay process. For systems performance also can do

In the log analysis they can do the KPI (key Performance indicators)

This is purely at batch

Another example of – billing applications

Backups – not critical hours they are going to trace backups

For this also batch jobs

We require fast response

Batch window

Main frames is the best batch processing system

Offer recommendations – past data generated and giving the recommendations

Batch processing is complex when its use while produce faster results

If data is large its won’t be process faster

Real-time Data: changes the data continually

Small differences like mille seconds

Latency: time interval b/w request and response this is called as latency

Low Latency: High response time

Air Traffic

Hospital

Bank ATM

Currently this doing real time

Example:

APP nexus

This is 3^rd party app vendor to give the recommendation based on the credit of users ..how this going to do

AppNexus having the customer credit data how is bit his customer having good credit score based on that they do recommendation pops in the websites

In-Memory Processing:

Bigdata -- > batch Dataà Hadoop

Bigdataà real time à Storm – can do real time not batch jobs

Hadoop for disk computations

Sparkà can do to both batch and real time processing

Data pipeline:

If you want build any data pipe line you can use do the following things

Any big data process

1> Injecting – Flume,Kafka,Sqoop

2> Storage -- HDFS,Hbase,Cassandra,

3> Processing – Pre processing , Post Processing MR,Hive,PIG,Sprak

4> Schedule -- OOZIE

5> Model (post processing Tools) – Post processing – BI Tools QlickView,tableu..so on

For doing this requires lot of integration

MLLIB – Machine learning libraries

Mahout – machine learning libraries

Bases projects of spark like

è Injecting the data

è Kafka

è Spark streaming

è Store data into Cassandra

è Spark jobs

Applying regular expressions

Iterative algorithms

Apache Spark – core

Spark SQL

Spark Streaming

MLIB

Hadoop written in java

Spark written by Scala

Scala have lot of advantages

Object oriented and functional programming language

2000 lines in java you can make this as 100 lines

One function can call other functions

Many of the inline functions

Fun1(fun2)...like this

First we need to learn scala then spark

Spark – memory computations

Hadoop – data processing computation engine

1.5 version use phython/scala

2015 – data frames

100% better than Hadoop

Word count in java here 2 lines in scala

is disk lost in spark we can take the back up using linear graph

Hadoop can run

Standalone

Psudeo distributed

Fully distributed

Spark run on below

Standalone ----- YARN - --- MESOS

RDD – in memory of machines

Hadoop Real Time Technical Questions and Answers for getting the job

Tuesday, December 22, 2015

A Role of Spark in Big Data introduction

Blog Archive