Full Text Search using Apache Lucene (Part-II)

In this post, I shall discuss about what is a Lucene Index, how to index and anlayse data. This is in continuation to my earlier post  Full Text Search using Apache Lucene (Part-I)

 

Indexing Data

       The first step in building a search application using Lucene is to index data that needs to be searched. Any data available in textual format can be indexed using Lucene. Lucene can also be used with any data source as long as textual information is extracted from it.

Indexing is the process of converting textual data into a format that facilitates rapid searching. Simple example is an index at the end of a book, which points reader to the page of the topic that appear in the book.  Lucene stores the input data in a data structure called an inverted index, which is stored on the file system or memory as a set of index files. It lets users to perform fast keyword look ups.

 

Following code snippets shows configuration of index in a file system.

<prop key=”hibernate.search.default.directory_provider”>filesystem</prop>

<prop key=”hibernate.search.default.indexBase”>/var/gracular/index</prop>

 

The below annotation tells the hibernate application to create a Lucene index for database table patents

@Entity

@Indexed(index=”patent”)

@Table(name=”patents”,catalog=”gracular_2″)

 

Indexing of data internally involves various steps.

  1. Analysis:

Analysis is converting the text data into a fundamental unit of searching, which is called as term. During analysis, the text data goes through multiple operations: extracting the words, removing common words, ignoring punctuation, reducing words to root form, changing words to lowercase, etc. Analysis happens just before indexing and query parsing. Analysis converts text data into tokens, and these tokens are added as terms in the Lucene index.

Lucene comes with various built-in analyzers, such as SimpleAnalyzer, StandardAnalyzer, StopAnalyzer, SnowballAnalyzer, and more. These differ in the way they tokenize the text and apply filters. As analysis removes words before indexing, it decreases index size, but it can have a negative effect on precision query processing. In addition to existing analyzers, users can define their custom analyzers.

Below code snippets creates a custom analyzer which tokenize terms based on delimeter ‘,’

@AnalyzerDef(

name=”customanalyzer”,

tokenizer=@TokenizerDef(factory=PatternTokenizerFactory.class,params=

@Parameter(name=”pattern”,value=”,”))

)

  1. Adding data to Index

There are two classes involved in adding text data to the index: Field and Document. Field represents a piece of data queried or retrieved in a search. The Field class encapsulates a field name and its value. Lucene provides options to specify if a field needs to be indexed or analyzed and if its value needs to be stored.

Document is a collection of fields. Lucene also supports boosting documents and fields, which is a useful feature if you want to give importance to some of the indexed data. Indexing a text file involves wrapping the text data in fields, creating a document, populating it with fields, and adding the document to the index using IndexWriter.

Below code snippet adds patent name column data to index and boosts its relevance value by 5 times.

@Column(name=”patent_name”,nullable=false,length=200)

@Index(name=”patentName”)

@Fields({

@Field,

@Field(name=”patentName”, analyze=Analyze.NO)

})

@Boost(value=5f)

private String patentName;

 

 

I shall discuss about how to search Indexed data in my next post.

Full Text Search using Apache Lucene (Part-III)

In this post, I shall discuss on how to perform search on Indexed data. This is in continuation to my earlier posts Full Text Search using Apache Lucene (Part-I) and Part-II

Searching Indexed Data:
Searching is a process of looking for words in the index and finding documents that contain those words. Hibernate Search provides API methods to perform different types of search on a given keyword
Below code snippets search colums “patentName, patentNumber, inventor” for a matching keyword on Patent table.
queryBuilder
.keyword()
.onFields(“patentName”, “patentNumber”,”inventor) .matching(keyword)
.createQuery();
FullTextQuery hibernateQuery=fullTextSession.createFullTextQuery(luceneQuery, Patent.class);
Fuzziness
Seldom we need search applications to handle typos, sound ex conditions while retrieving search results. It can be achieved in lucene by creating a Fuzzy query. We can make above query to handle typos, sound ex by modifying the query as below.
luceneQuery = queryBuilder
.keyword()
.fuzzy()
.withThreshold(0.8f)
.onFields(“patentName”, “patentNumber”,”inventor) .matching(key)
.createQuery();
withThreshold( ) is used to specify the amount of fuzziness.
i. Displaying Search Results:
IndexSearcher returns an array of references to ranked search results, such as documents that match a given query. Customized paging can be built on top of this. A custom Web application or desktop application can be used to display search results.

FullTextQuery hibernateQuery=fullTextSession.createFullTextQuery(luceneQuery,
Patent.class);
hibernateQuery.setFirstResult(0);
hibernateQuery.setMaxResults(50);

Sorting Search Results
Sort sort=new Sort(new SortField(“patentName”, SortField.STRING, false));
hibernateQuery.setSort(sort);
Here false – Sort Ascending order, true  Sort Descending order.

Conclusion
Lucene, a very popular open source search library from Apache, provides powerful indexing and searching capabilities for applications. It provides a simple and easy-to-use API that requires minimal understanding of the internals of indexing and searching. In this article, you learned about Lucene architecture and its core APIs.
Lucene has powered various search applications being used by many well-known Web sites and organizations. It has been ported to many other programming languages. Lucene has a large and active technical user community. If you’re looking for an easy-to-use, scalable, and high performing open-source search library, Apache Lucene is a great choice.

Full Text Search using Apache Lucene (Part-I)

Introduction:

Lucene is an open source, highly scalable text search-engine library available from the Apache Software Foundation. Lucene’s powerful APIs focus mainly on text indexing and searching. It can be used to build search capabilities for applications such as e-mail clients, mailing lists, Web searches, database search, etc. Web sites like Wikipedia, LinkedIn have been powered by Lucene.

Lucene has many features. It:

  • Has powerful, accurate, and efficient search algorithms.
  • Calculates a score for each document that matches a given query and returns the most relevant documents ranked by the scores.
  • Supports many powerful query types, such as PhraseQuery, WildcardQuery, RangeQuery, FuzzyQuery, BooleanQuery, and more.
  • Supports parsing of human-entered rich query expressions.
  • Allows users to extend the searching behavior using custom sorting, filtering, and query expression parsing.
  • Uses a file-based locking mechanism to prevent concurrent index modifications.
  • Allows searching and indexing simultaneously.

 

Steps to build an Application using Apache Lucene:

The below image demonstrates various stages/phases in building an application using Lucene.

  1. Indexing data
  2. Analysing data
  3. Searching Indexed data.

 

jeevan

I will discuss about how to Index data to make it searchable and how to search Lucene indexed data in subsequent posts.

 

logo
logo

Chat

Hi! Welcome to KNS Technologies Chatbot.