Hadoop Real Time Technical Questions and Answers for getting the job: 2015

Tuesday, December 22, 2015

A Role of Spark in Big Data introduction

Big Data:

Batch and real time

For downloading you have to do it by latest which linux flavor

Ubantu.15.10 downloads this then vm player

Ubuntu-> http://www.ubuntu.com/download/desktop

Vmplayer

vmware –

http://www.vmware.com/my_maintenance.html#desktop_end_user_computing/vmware_player/7_0

Select ram as around 6GB in vmplayer

Then update ubantu 64 bit .iso file

Where data is using in traditional systems

RDBMS

This is only pure for business profit oriented process like transactional data

Web + IOT (internet of things )

Web can generate huge data and the devices to connect internet then generate data this is called as IOT data

80% of data comes from IOT

Batch processing: it’s processing defining series of jobs

For example in bank transactions having batches some interval all the NEFT/RTGS transactions triggers

Input data can collect over time of period its non-continues process data

Batch window: log analysis and send back the response then do the recommendation this is going to be delay process. For systems performance also can do

In the log analysis they can do the KPI (key Performance indicators)

This is purely at batch

Another example of – billing applications

Backups – not critical hours they are going to trace backups

For this also batch jobs

We require fast response

Batch window

Main frames is the best batch processing system

Offer recommendations – past data generated and giving the recommendations

Batch processing is complex when its use while produce faster results

If data is large its won’t be process faster

Real-time Data: changes the data continually

Small differences like mille seconds

Latency: time interval b/w request and response this is called as latency

Low Latency: High response time

Air Traffic

Hospital

Bank ATM

Currently this doing real time

Example:

APP nexus

This is 3^rd party app vendor to give the recommendation based on the credit of users ..how this going to do

AppNexus having the customer credit data how is bit his customer having good credit score based on that they do recommendation pops in the websites

In-Memory Processing:

Bigdata -- > batch Dataà Hadoop

Bigdataà real time à Storm – can do real time not batch jobs

Hadoop for disk computations

Sparkà can do to both batch and real time processing

Data pipeline:

If you want build any data pipe line you can use do the following things

Any big data process

1> Injecting – Flume,Kafka,Sqoop

2> Storage -- HDFS,Hbase,Cassandra,

3> Processing – Pre processing , Post Processing MR,Hive,PIG,Sprak

4> Schedule -- OOZIE

5> Model (post processing Tools) – Post processing – BI Tools QlickView,tableu..so on

For doing this requires lot of integration

MLLIB – Machine learning libraries

Mahout – machine learning libraries

Bases projects of spark like

è Injecting the data

è Kafka

è Spark streaming

è Store data into Cassandra

è Spark jobs

Applying regular expressions

Iterative algorithms

Apache Spark – core

Spark SQL

Spark Streaming

MLIB

Hadoop written in java

Spark written by Scala

Scala have lot of advantages

Object oriented and functional programming language

2000 lines in java you can make this as 100 lines

One function can call other functions

Many of the inline functions

Fun1(fun2)...like this

First we need to learn scala then spark

Spark – memory computations

Hadoop – data processing computation engine

1.5 version use phython/scala

2015 – data frames

100% better than Hadoop

Word count in java here 2 lines in scala

is disk lost in spark we can take the back up using linear graph

Hadoop can run

Standalone

Psudeo distributed

Fully distributed

Spark run on below

Standalone ----- YARN - --- MESOS

RDD – in memory of machines

Monday, November 23, 2015

Few More Questions in Hadoop Interview

1)        Explain about the current roles and responsibilities ?
2)        Difference between Hadoop1 and hadoop2?
3)        How much size of the cluster your using currently?
4)        Difference between writable and writeomparable in map reduces programming?
5)        Did you combiners in map reduce how to use?
6)        Did you use counters in map reduce programming?
7)        How to write code portions in map reduce programming?
8)        How to do serialize and deserialization the data in map reduce progrmming?
9)        How to swap two data columns date by 11/11, 11/12 in hive.
10)     Did you use d-link how to connect with the hive?
11)     How to do bucking in map reduce?
12)     How to merge two tables eg small and big table using pig or hive?
13)     What is purpose of the spark and what is RDD and context?
14)     Did you write UDF’s in Hive what purpose did you write?
15)     Difference between orderby and sort by in hive?
16)     What is the hive metastore and purpose of this?
17)     How to clean the data with special characters in PIG?
18)     What is kafka and how manage this into traditional way?
19)     What are the main components in STORM?
20)     Solr usage and why you used this into your project?

Thanks

Narasimha

Hadoop QA/QE Roles and Migrate notes Basics

Hadoop has 2 noteworthy Distributions

Cloudera and Hortonworks

significant center parts :

HDFS(Data capacity) and MAP reduce(processing u can use phython/scala/java) and major roles which company's are looking underneath below

1. Hadoop Admin

2. Hadoop Development

3. Hadoop QA/QE

Hadoop Admin - is equivalent to framework/web/system administrator

Hadoop development - for any java,Scala, Phython software developers

Hadoop QA - for the analyzer who has min experience at least basics on java/scala

i. How relavent it is for QA ?

Answer:

in the event that you learn hadoop you have the splendid future. let say you can discover the occupation like the prerequisites

* Plan, code, and execute computerized testing (with some manual testing if mechanization is impractical). This is the fundamental center of the position.

* Analyze issues, report absconds, and propose relapse tests to find re events.

* Work intimately with numerous improvement groups on end-to-end testing.

* Check for information consistency over numerous sources.

* Manufacture reenacted full generation situations for complex testing situations.

* Manufacture and convey custom testing structures for REST-based API's utilizing JSON information positions.

back ground of this : you ought to know junit testing, manual , Automation and REST web services testing which we have parcel of occupations going on you need to do end to end testing.

you can discover the ETL utilizing a percentage of the apparatus like Talend..etc

ii. On the off chance that I learn Haddop now, Where do I begin (which postition) my profession ? Would I be able to begin from where I was remaining in QA ?

yes you will begin anyplace like same which you working at present in QA. Hadoop QA lead, Hadoop QA Engineering, HAdoop QA Automation..etc

nothing change what your are currently working however this is extra and new innovation.

I trust you get clear picture now.

*********

presently part of innovation came into picture like

Spark - this is new Hadoop highlight executed by Scala(functional+object oriented Language) in memory calculation instead of disk.

IOT + Big Data - attempt to concentrate on this is amazing element on the off chance that you can discover the you-tube you can get more thought what is IOT.

It would be ideal if you survey and let me know whether you require any illumination require on this..

Thursday, October 22, 2015

Why we use Kafka if Flume can do same things!!

Lets speak more about this appears to be numerous individuals confronting this issue in meeting.

Kafka will be used to bridge the architectural gaps between Flume and Storm we found 3 as of now

a) The Flume sink uses a push mechanism for sending events, while Storm spout uses the pull mechanism for the events.

b) Flume sink uses a push mechanism; hence it must know in advance where events need to be pushed to particular node to run the topology in different stages.

c) Kafka is based on the publish-subscribe programming paradigm here producers (flume) push mechanism and consumers (storm) have pull mechanism in our architecture.

Wednesday, October 21, 2015

Interview Questions in Hadoop Experienced

Howdy All,

on the off chance that you take after these inquiry you without a doubt break the meeting I wager on it. Kindly make it clear comprehend the inquiry and give or set up your own particular responses for this.I wanna share all these ongoing inquiry question which I am confronting day by day.

1) Explain about your architecture?

2) What are the challenges did you face at least 3 ?

3) What are the main storm cluster components?

4) What happens if nimbus node dead?

5) How to do versioning in Cassandra?

6) What is kafka and difference between kafka and flume?

7) What is the signature of mapper ?

8) Map reduce phases and importance?

9) How to give the own input to the mapper directly if main method defined?

10) What are the demons in Hadoop cluster?

11) Difference between hadoop1 and hadoop2?

12) If you migrated hadoop2 did you see any differences?

13) What are the versions are you using in your project?

14) What is the format of the data?

15) What is the data size your handling daily?

16) How many events you fire in a second?

17) What is the rate yourself while building same kind of cluster?

18) How to make custom input to your mapper?

19) How many mapper and reducer will going to run if particular size of data?

20) How to run command line mapper program syntax?

21) Did you work on oozie in your current project?

22) Did you use splunk and what is your role in splunk?

23) What is the difference between set and list?

24) Struts flow?

25) Difference between hash code and equals what happens if you override ?

26) Difference between compare and comparable interfaces?

27) What is custom UDF and what are the UDFS did you written?

28) How to use the udf in your projects?

29) Which version of java are you using currently?

30) Did you involve any data conversion like .csv or json format?

31) What are the limitations of hive?

32) If I have 1000 records how to make insert/delete using Sqoop?

Friday, October 2, 2015

Kafka Cluster more precise in true

· Apache Kafka is an appropriated distribute subscribe informing framework publicly released by LinkedIn

· It accompanies straight versatility, elite and information is recreated and divided

· One must be inquiring as to why Kafka is in the outline when flume gives comparable usefulness

· Kafka will be utilized to connect the structural holes in the middle of Flume and Storm.

· Example of 3 holes why Kafka

o The Flume sink utilizes a push component for sending occasions, while Storm spout utilizes the an instrument for the occasions

o work submitted to Storm is known as a topology. Different phases of Storm topologies (MR sort employments) keeps running on distinctive hubs and Storm chiefs focus the hubs in the bunch to run diverse phases of a topology Essentially, the rundown of hubs where a specific topology will run is not known ahead of time

o The Flume sink utilizes a push system, subsequently it must know ahead of time where occasions should be pushed to, which is not the situation here

o In the POC, we didn't utilize Kafka as we added to a custom flume sink to push occasions specifically to Storm as just single hub bunch was being utilized. This is not going to work once the Storm bunch has more than one hub

· Kafka, then again, depends on the distribute subscribe programming worldview. Makers push messages and customers force messages. In our outline, there is a push framework (Flume) toward one side and force framework (Storm) on the flip side. Henceforth, Kafka is an impeccable answer for extension this hole and determines both issues said b

Thursday, October 1, 2015

Apache Solr Architecture In Real World

What is Solr?

Apache Solr is a fast open-source Java search server. Solr enables you to easily create search

engines which searches websites, databases and files.

Solr is the popular, blazing fast open source enterprise search platform from the Apache

Lucene project. Solr is powered by Lucene, a powerful open-source full-text search library,

under the hood.

• Doug Cutting created Lucene in 1999.Recognized as a top level Apache Software

Foundation project in 2005

• Yonik Seeley created Solr in 2004.Recognized as a top level Apache Software

Foundation project in 2007

Its major features include powerful full-text search, hit highlighting, faceted search,

dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and

geospatial search.

Solr is highly scalable, providing distributed search and index replication, and it powers

the search and navigation features of many of the world's largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet

container such as Jetty.

Solr uses the Lucene Java search library at its core for full-text indexing and search, and

has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming

language.

Solr's powerful external configuration allows it to be tailored to almost any type of

application without Java coding, and it has an extensive plugin architecture when more

advanced customization is required.

Solr makes it easy to add the capability to search through the online store through the following

steps:

 Define a schema. The schema tells Solr about the contents of documents it will be

indexing. In the online store example, the schema would define fields for the product

name, description, price, manufacturer, and so on. Solr's schema is powerful and

flexible and allows you to tailor Solr's behavior to your application. See Documents,

Fields, and Schema Design for all the details.

 Deploy Solr to your application server.

 Feed Solr the document for which your users will search.

 Expose search functionality in your application.

Solr is able to achieve fast search responses because, instead of searching the text

directly, it searches an index instead.

This is like retrieving pages in a book related to a keyword by scanning the index at the back of

a book, as opposed to searching every word of every page of the book.

This type of index is called an inverted index, because it inverts a page-centric data structure

(page->words) to a keyword-centric data structure (word->pages).

Solr stores this index in a directory called index in the data directory.

How Solr represents data

In Solr, a Document is the unit of search

An index consists of one or more Documents, and a Document consists of one or more Fields.

In database terminology, a Document corresponds to a table row, and a Field

corresponds to a table column.

When data is added to Solr, it goes through a series of transformations before being added to

the index. This is called the analysis phase. Examples of transformations include lower-casing,

removing word stems etc. The end result of the analysis are a series of tokens which are then

added to the index. Tokens, not the original text, are what are searched when you perform a

search query.

indexed fields are fields which undergo an analysis phase, and are added to the index.If a field

is not indexed, it cannot be searched on.

Solr Features

 Keyword Searching – queries of terms and boolean operators

 Ranked Retrieval – sorted by relevancy score (descending order)

 Snippet Highlighting – matching terms emphasized in results

 Faceting – ability to apply filter queries based on matching fields

 Paging Navigation – limits fetch sizes to improve performance

 Result Sorting – sort the documents based on field values

 Spelling Correction – suggest corrected spelling of query terms

 Synonyms – expand queries based on configurable definition list

 Auto-Suggestions – present list of possible query terms

 More Like This – identifies other documents that are similar to one in a

 result set

 Geo-Spatial Search – locate and sort documents by distance

 Scalability – ability to break a large index into multiple shards and

 distribute indexing and query operations across a cluster of nodes

A complete Architecture

Indexing Process

WebLogic Server Logging

Understanding WebLogic Logging Services

– WebLogic logging services provide facilities for writing, viewing, filtering, and listening for log

– These log messages are generated by WebLogic Server instances, subsystems, and Java EE

messages.

applications that run on WebLogic Server or in client JVMs.

• WebLogic Server subsystems use logging services to provide information about events

– Ex. deployment of new applications or the failure of one or more subsystems

• Each WebLogic Server instance maintains a server log

• Logging services collect messages that are generated on multiple server instances into a

single, domain-wide message log

• The domain log provides the overall status of the domain

Log Message Format

Here is an example of a message in the server log file:

####<Sept 22, 2004 10:46:51 AM EST> <Notice> <WebLogicServer> <MyComputer>

<examplesServer> <main> <<WLS Kernel>> <> <null> <1080575211904> <BEA-000360>

<Server started in RUNNING mode>

• In this example, the message attributes are:

Locale-formatted Timestamp, Severity, Subsystem, Machine Name, Server Name, Thread ID, User ID,

Transaction ID, Diagnostic Context ID, Raw Time Value, Message ID, and Message Text.

Time stamp Time and date when the message originated.

The Java Virtual Machine (JVM) that runs each WebLogic Server instance refers to the host computer

operating system for information about the local time zone and format.

Severity Indicates the degree of impact or seriousness of the event reported by the

message.

Subsystem Indicates the subsystem of WebLogic Server that was the source of the

Machine Name

Server

Name

Thread ID

message; for example, Enterprise Java Bean (EJB) container or Java

Messaging Service (JMS).

Identifies the origins of the message:

 Server Name is the name of the WebLogic Server

 Machine Name is the DNS name of the computer

 Thread ID is the ID that the JVM assigns to the thread in which the

message originated

User ID The user ID under which the associated event was executed

Transaction ID Present only for messages logged within the context of a transaction

Diagnostic

Context ID

Raw Time Value The timestamp in milliseconds

Message ID A unique six-digit identifier

Context information to correlate messages coming from a specific request or

application

All message IDs that WebLogic Server system messages generate start with

BEA- and fall within a numerical range of 0-499999.

TechSys IT Solutions

Message Text A description of the event or condition.

Monday, September 28, 2015

Agile Release process time lines milestones

Project Review

Release Brnaching

Hard Lock
Release Freeze

Module Production1

Module production2

Module production3

Support Engineering Roles and Responsibilities if your looking job in this Area

This is XXX, I have around 10 years of inclusion in java and j2EE advancements started with my carrer in XX as a java engineer for b2b application and worked around 3 years then I moved to XXX as application bolster engineer with one of the primary customer as "XXX " in that worked around 6 months then I moved to XXX as an application originator which worked around 3 years sometime later I moved to XXX as application bolster Developer worked around 4 years then joined in XXX company as Release and CD process after deft system to run each one of the designers and continues with transport to keep up any course of action issues we need to get the architect to change the issues if any hunk blocker or centerpieces.

What sort of bolster which you included in XXX and XXX is your bolster engineer .

In the XXX we have a business to client application which has that application in that each client can book the hotles which are identified with the XXX in the globe if the client have any issues mind unwaveringness focuses and bounus they make a ticket and dole out to us to make settle this identifies with the L2 ( line 2 bolster) we need to alter DB fixes if any client uniq id ..so on

I included more into XXX driving a group which bolster whole seaward

XXX have one possessed as b2b application as undertaking having 256 dynamic items alongside a few alternatives 1 ,2 ,3 and 4 , around the 8000 clients are hitting this application and putting in the requests. This application having the 20 frameworks inside imparting while putting in the requests clients can get issues and stuck in the stream in view of that end client calls to SSRC group and they make the tickets .

L1 support as SSRC ( deals bolster asset focus) they need to get the call from end clients and spot the tickets.

L2 bolster : get all the ticket which against the name of out application i.e XXX application make a guideline into the SSRC device and get the warnings to every one of the groups and allot to group and send a rundown of ticket which who is the ticket's proprietor and relegate in view of the usefulness.

We have the item specialists I am master on Product1 and item 2 and every real item. On the off chance that the issue exists I needs to give the determination and close that.

On the off chance that anything inputs requires from client we needs to Route the SSRC L1 group to verify get the information from the client to settle this issues

L3 support : if anything related rehashed issues and code issues we needs to make an imperfection and appoint to Dev group.

L4: support: any utilitarian changes or business requires to changes we needs to appoint to this to center advancement groups.

My parts:

Appoint to group every one of the tickets which in view of item and practically.

Send a notice sends to groups for works.

Altered the same number of issues which related the client SLA's

Making client check rundown to before call the SSRC group.

Appropriate the rundown L1 support on the off chance that they requires client preparing issues arranged one rundown and request that they catch up those things.

Set up the gatherings with clients and clarify essential usefulness issues which are connected client .

Much of the time utilize the DB fixes.

Make the item based client check rundown saved over the groups.

Leading EOM end of the month visits straightforwardly with end clients to close every one of the issues on time.

Send every one of the issues reports in EOM and show into higher administration.

Sending every one of the fixes report and who worked in a week and what number of they close if any bolster architect gives any wrong determination we needs to right and upgrade the determination.

Apache Storm and Kafka in real time Scenarios...Hadoop Eco System

we need to come to know all the terminology here

Topology -- combination of spouts and bolts

Spouts -- reduce programs same like job tracker process

Nimbus -- daemon which runs in as master like name node

Bolt -- Map program like task tracker

Zookeeper -- mediator between all the configurations nimbus and spouts..
Redis -- Key value stores

for doing the practical programs you need to know the all the related jars and requires some environments

in..progress...

Agile process in Hadoop real world

Hello there,

Deft procedure progressively :

Month to month 2 leases in the event that you have numerous application relies on upon area and association.

illustration

module 1

module 2

module 1 having 10 application

module 2 having 20 application

module 1 is expect legacy

module 2 is most recent overhauling ones

in one month we need to discharge both module 1 and module 2

prerequisites

investigation

task demand determination for changes

advancements

incline labs for speedier conveyance

testing QA

relapse, piece box, manual ..and so on

execution testing getting surrenders and catch up the dev groups and make it fix

triage bug settling meeting day by day

a few deformities making contrast for future discharge

every last usefulness we have the property if anything turns out badly they can off it on the fly

Discharge Day if smooth they have 2 sub parts Cell An and Cell B

Cell A morning discharge and Cell me in the Evening

each one of those stages are running in parallel will finish every one of the things in with in an edge time.

in Hadoop they can make a container alongside the UI segments and send into the servers nothing particular for some other usefulness.

if you need any doubts please add comment and let me know!!

Friday, September 25, 2015

Real Time Hadoop Architecture

Real Time Hadoop Architecture.

Technology list in hadoop eco system.

Cluster in real time

Utilizing the beneath you can make it one work stream for getting the logs to appear as measurements

HDFS

MAP Reduce

Hive

Hbase

Solr

Storm

Flume

Kafka

ZooKeeper

Redis

Solr -- indexing

kafka -- distributing

Storm -- processing like map reduce programs for real time events

Flume cluster -- getting the logs jvm and application logs in real time

Zookeeper -- control all the configurations related jobs

Hbase -- making as data stored in column oriented way

Hive -- for querying your metrics and display in UI

along with normally people club the CDH(cloudera) model and HDP( Hortonworks)

still in progress....

Looking for job in Hadoop/Bigdata lets follow me!!

Greetings All,

Lets time to on account of read this present article..let's get to the meaningful part . this web journals valuable for who is searching for employment in Hadoop/BigData Technologies.

on the off chance that you done your course or your a subject specialists however you have to answers every one of the inquiries from questioner.

for getting job in Hadoop you should know below things

Hadoop Architecture -- This comes from reading and leaning

HDFS - Storing
Map reduce -- Processing

Now a days no one using the Batch Processing but you should be strong enough the architecture of core part like

NameNode -- Master node

SNN -- master
JOBTracker -- master
Task Tracker -- slave
Data Node -- slave

NOTE1:

Continuously circumstance you need to speak with just name node machine you don't know different machines and evil spirits if your PIG/HIVE designer .

1) Tell me about yourself ?

Answer : you much know more than me but make sure be confident.

2) Explain about your company use case ?

Answer :

3) Explain about your job work flow process?

4) What is your hadoop cluster size?

5) Data retention policy ?

6) what is the size of you each node hardware configuration?

7) Cluster capacity of Data each node?

8) Per day how much data your handling?

9) What type of data your using in your cluster ?

10) How to store the data into your hadoop cluster? what is the format ?

11) What is Hadoop how you can define in your way?

12) Which version of hadoop eco are you using ?

13) What is the difference between MR1 and MR2 ? or hadoop1 or hadoop2?

14) what are the core parts of hadoop?

15) what is the cluster throughput ?

16) what is the cluster benchmarks?

17) what are the tools are you using in your company?

18) explain the map reduce flow?

19) did you write any map reduce programs ?

20) what is the input format did you use in your use case?

21) what are the monitoring tools did you use?

22) did you use PUPPET? Nagios in your project?

23) how do you know the job failures expect log and alert mechanism ?

24) how many jobs are running daily how much data can handle?

25) explain about your role in your architecture ?

26) do you have any idea about developemnt/admin depends on your position?

27) did you write a reducers in your programs?

28) how to write custom input format ?

29) how to read the mail contents in your HDFS?

30) difference between PIG and HIVE?

31) what is HDFS federation ?

32) how to tune the low performed jobs ?

33) how to trouble shoot the cluster if your not a admin ?

34) how to tune the map reduce programs ?

35) should you explain what is PIG?

36) what is GROUP and COGROUP?

37) did you write UDF's ?

38) explain one UDF how you can make it done and why you used?

39) did you use any external tables in HIVE?

40) did you create any table in Hive /Pig explain the syntax?

41) what is Flume?

42) in a cluster we have 4 nodes flume agent is single or 4 needs to install?

43) what is the channel in flume did you use?

44) can you set up your own cluster if i give the machines?

45) can you explain your starting to ending flow if failure anything happens in millions of jobs how to trace it ?

Still in progress...