Thursday, October 1, 2015

Apache Solr Architecture In Real World

What is Solr?

Apache Solr is a fast open-source Java search server. Solr enables you to easily create search

engines which searches websites, databases and files.

Solr is the popular, blazing fast open source enterprise search platform from the Apache

Lucene project. Solr is powered by Lucene, a powerful open-source full-text search library,

under the hood.

• Doug Cutting created Lucene in 1999.Recognized as a top level Apache Software

Foundation project in 2005

• Yonik Seeley created Solr in 2004.Recognized as a top level Apache Software

Foundation project in 2007

Its major features include powerful full-text search, hit highlighting, faceted search,

dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and

geospatial search.

Solr is highly scalable, providing distributed search and index replication, and it powers

the search and navigation features of many of the world's largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet

container such as Jetty.

Solr uses the Lucene Java search library at its core for full-text indexing and search, and

has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming

language.

Solr's powerful external configuration allows it to be tailored to almost any type of

application without Java coding, and it has an extensive plugin architecture when more

advanced customization is required.

       


      

Solr makes it easy to add the capability to search through the online store through the following

steps:

 Define a schema. The schema tells Solr about the contents of documents it will be

indexing. In the online store example, the schema would define fields for the product

name, description, price, manufacturer, and so on. Solr's schema is powerful and

flexible and allows you to tailor Solr's behavior to your application. See Documents,

Fields, and Schema Design for all the details.

 Deploy Solr to your application server.

 Feed Solr the document for which your users will search.

 Expose search functionality in your application.

Solr is able to achieve fast search responses because, instead of searching the text

directly, it searches an index instead.

This is like retrieving pages in a book related to a keyword by scanning the index at the back of

a book, as opposed to searching every word of every page of the book.

This type of index is called an inverted index, because it inverts a page-centric data structure

(page->words) to a keyword-centric data structure (word->pages).

Solr stores this index in a directory called index in the data directory.

How Solr represents data

In Solr, a Document is the unit of search

An index consists of one or more Documents, and a Document consists of one or more Fields.

In database terminology, a Document corresponds to a table row, and a Field

corresponds to a table column.

When data is added to Solr, it goes through a series of transformations before being added to

the index. This is called the analysis phase. Examples of transformations include lower-casing,

removing word stems etc. The end result of the analysis are a series of tokens which are then

added to the index. Tokens, not the original text, are what are searched when you perform a

search query.

indexed fields are fields which undergo an analysis phase, and are added to the index.If a field

is not indexed, it cannot be searched on.

Solr Features

 Keyword Searching – queries of terms and boolean operators

 Ranked Retrieval – sorted by relevancy score (descending order)

 Snippet Highlighting – matching terms emphasized in results

 Faceting – ability to apply filter queries based on matching fields

 Paging Navigation – limits fetch sizes to improve performance

 Result Sorting – sort the documents based on field values

 Spelling Correction – suggest corrected spelling of query terms

 Synonyms – expand queries based on configurable definition list

 Auto-Suggestions – present list of possible query terms

 More Like This – identifies other documents that are similar to one in a

 result set

 Geo-Spatial Search – locate and sort documents by distance

 Scalability – ability to break a large index into multiple shards and

 distribute indexing and query operations across a cluster of nodes

A complete Architecture 





Indexing Process





No comments:

Post a Comment