Hadoop Real Time Technical Questions and Answers for getting the job: October 2015

Thursday, October 22, 2015

Why we use Kafka if Flume can do same things!!

Lets speak more about this appears to be numerous individuals confronting this issue in meeting.

Kafka will be used to bridge the architectural gaps between Flume and Storm we found 3 as of now

a) The Flume sink uses a push mechanism for sending events, while Storm spout uses the pull mechanism for the events.

b) Flume sink uses a push mechanism; hence it must know in advance where events need to be pushed to particular node to run the topology in different stages.

c) Kafka is based on the publish-subscribe programming paradigm here producers (flume) push mechanism and consumers (storm) have pull mechanism in our architecture.

Wednesday, October 21, 2015

Interview Questions in Hadoop Experienced

Howdy All,

on the off chance that you take after these inquiry you without a doubt break the meeting I wager on it. Kindly make it clear comprehend the inquiry and give or set up your own particular responses for this.I wanna share all these ongoing inquiry question which I am confronting day by day.

1) Explain about your architecture?

2) What are the challenges did you face at least 3 ?

3) What are the main storm cluster components?

4) What happens if nimbus node dead?

5) How to do versioning in Cassandra?

6) What is kafka and difference between kafka and flume?

7) What is the signature of mapper ?

8) Map reduce phases and importance?

9) How to give the own input to the mapper directly if main method defined?

10) What are the demons in Hadoop cluster?

11) Difference between hadoop1 and hadoop2?

12) If you migrated hadoop2 did you see any differences?

13) What are the versions are you using in your project?

14) What is the format of the data?

15) What is the data size your handling daily?

16) How many events you fire in a second?

17) What is the rate yourself while building same kind of cluster?

18) How to make custom input to your mapper?

19) How many mapper and reducer will going to run if particular size of data?

20) How to run command line mapper program syntax?

21) Did you work on oozie in your current project?

22) Did you use splunk and what is your role in splunk?

23) What is the difference between set and list?

24) Struts flow?

25) Difference between hash code and equals what happens if you override ?

26) Difference between compare and comparable interfaces?

27) What is custom UDF and what are the UDFS did you written?

28) How to use the udf in your projects?

29) Which version of java are you using currently?

30) Did you involve any data conversion like .csv or json format?

31) What are the limitations of hive?

32) If I have 1000 records how to make insert/delete using Sqoop?

Friday, October 2, 2015

Kafka Cluster more precise in true

· Apache Kafka is an appropriated distribute subscribe informing framework publicly released by LinkedIn

· It accompanies straight versatility, elite and information is recreated and divided

· One must be inquiring as to why Kafka is in the outline when flume gives comparable usefulness

· Kafka will be utilized to connect the structural holes in the middle of Flume and Storm.

· Example of 3 holes why Kafka

o The Flume sink utilizes a push component for sending occasions, while Storm spout utilizes the an instrument for the occasions

o work submitted to Storm is known as a topology. Different phases of Storm topologies (MR sort employments) keeps running on distinctive hubs and Storm chiefs focus the hubs in the bunch to run diverse phases of a topology Essentially, the rundown of hubs where a specific topology will run is not known ahead of time

o The Flume sink utilizes a push system, subsequently it must know ahead of time where occasions should be pushed to, which is not the situation here

o In the POC, we didn't utilize Kafka as we added to a custom flume sink to push occasions specifically to Storm as just single hub bunch was being utilized. This is not going to work once the Storm bunch has more than one hub

· Kafka, then again, depends on the distribute subscribe programming worldview. Makers push messages and customers force messages. In our outline, there is a push framework (Flume) toward one side and force framework (Storm) on the flip side. Henceforth, Kafka is an impeccable answer for extension this hole and determines both issues said b

Thursday, October 1, 2015

Apache Solr Architecture In Real World

What is Solr?

Apache Solr is a fast open-source Java search server. Solr enables you to easily create search

engines which searches websites, databases and files.

Solr is the popular, blazing fast open source enterprise search platform from the Apache

Lucene project. Solr is powered by Lucene, a powerful open-source full-text search library,

under the hood.

• Doug Cutting created Lucene in 1999.Recognized as a top level Apache Software

Foundation project in 2005

• Yonik Seeley created Solr in 2004.Recognized as a top level Apache Software

Foundation project in 2007

Its major features include powerful full-text search, hit highlighting, faceted search,

dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and

geospatial search.

Solr is highly scalable, providing distributed search and index replication, and it powers

the search and navigation features of many of the world's largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet

container such as Jetty.

Solr uses the Lucene Java search library at its core for full-text indexing and search, and

has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming

language.

Solr's powerful external configuration allows it to be tailored to almost any type of

application without Java coding, and it has an extensive plugin architecture when more

advanced customization is required.

Solr makes it easy to add the capability to search through the online store through the following

steps:

 Define a schema. The schema tells Solr about the contents of documents it will be

indexing. In the online store example, the schema would define fields for the product

name, description, price, manufacturer, and so on. Solr's schema is powerful and

flexible and allows you to tailor Solr's behavior to your application. See Documents,

Fields, and Schema Design for all the details.

 Deploy Solr to your application server.

 Feed Solr the document for which your users will search.

 Expose search functionality in your application.

Solr is able to achieve fast search responses because, instead of searching the text

directly, it searches an index instead.

This is like retrieving pages in a book related to a keyword by scanning the index at the back of

a book, as opposed to searching every word of every page of the book.

This type of index is called an inverted index, because it inverts a page-centric data structure

(page->words) to a keyword-centric data structure (word->pages).

Solr stores this index in a directory called index in the data directory.

How Solr represents data

In Solr, a Document is the unit of search

An index consists of one or more Documents, and a Document consists of one or more Fields.

In database terminology, a Document corresponds to a table row, and a Field

corresponds to a table column.

When data is added to Solr, it goes through a series of transformations before being added to

the index. This is called the analysis phase. Examples of transformations include lower-casing,

removing word stems etc. The end result of the analysis are a series of tokens which are then

added to the index. Tokens, not the original text, are what are searched when you perform a

search query.

indexed fields are fields which undergo an analysis phase, and are added to the index.If a field

is not indexed, it cannot be searched on.

Solr Features

 Keyword Searching – queries of terms and boolean operators

 Ranked Retrieval – sorted by relevancy score (descending order)

 Snippet Highlighting – matching terms emphasized in results

 Faceting – ability to apply filter queries based on matching fields

 Paging Navigation – limits fetch sizes to improve performance

 Result Sorting – sort the documents based on field values

 Spelling Correction – suggest corrected spelling of query terms

 Synonyms – expand queries based on configurable definition list

 Auto-Suggestions – present list of possible query terms

 More Like This – identifies other documents that are similar to one in a

 result set

 Geo-Spatial Search – locate and sort documents by distance

 Scalability – ability to break a large index into multiple shards and

 distribute indexing and query operations across a cluster of nodes

A complete Architecture

Indexing Process

WebLogic Server Logging

Understanding WebLogic Logging Services

– WebLogic logging services provide facilities for writing, viewing, filtering, and listening for log

– These log messages are generated by WebLogic Server instances, subsystems, and Java EE

messages.

applications that run on WebLogic Server or in client JVMs.

• WebLogic Server subsystems use logging services to provide information about events

– Ex. deployment of new applications or the failure of one or more subsystems

• Each WebLogic Server instance maintains a server log

• Logging services collect messages that are generated on multiple server instances into a

single, domain-wide message log

• The domain log provides the overall status of the domain

Log Message Format

Here is an example of a message in the server log file:

####<Sept 22, 2004 10:46:51 AM EST> <Notice> <WebLogicServer> <MyComputer>

<examplesServer> <main> <<WLS Kernel>> <> <null> <1080575211904> <BEA-000360>

<Server started in RUNNING mode>

• In this example, the message attributes are:

Locale-formatted Timestamp, Severity, Subsystem, Machine Name, Server Name, Thread ID, User ID,

Transaction ID, Diagnostic Context ID, Raw Time Value, Message ID, and Message Text.

Time stamp Time and date when the message originated.

The Java Virtual Machine (JVM) that runs each WebLogic Server instance refers to the host computer

operating system for information about the local time zone and format.

Severity Indicates the degree of impact or seriousness of the event reported by the

message.

Subsystem Indicates the subsystem of WebLogic Server that was the source of the

Machine Name

Server

Name

Thread ID

message; for example, Enterprise Java Bean (EJB) container or Java

Messaging Service (JMS).

Identifies the origins of the message:

 Server Name is the name of the WebLogic Server

 Machine Name is the DNS name of the computer

 Thread ID is the ID that the JVM assigns to the thread in which the

message originated

User ID The user ID under which the associated event was executed

Transaction ID Present only for messages logged within the context of a transaction

Diagnostic

Context ID

Raw Time Value The timestamp in milliseconds

Message ID A unique six-digit identifier

Context information to correlate messages coming from a specific request or

application

All message IDs that WebLogic Server system messages generate start with

BEA- and fall within a numerical range of 0-499999.

TechSys IT Solutions

Message Text A description of the event or condition.