Apache Solr
- Search Engine
- Issues with traditional searching
- Performs only sub string matching
- It doesn't understand linguistic variations (buy vs buying are same)
- It doesn't understand synonyms(buying vs purchasing)
- doesn't omit unimportant words(a, the, of..)
- There is no sense of relevancy in results
- It slows down with data increase
- How does Solr solve the traditional search issues
- Solr uses an index that maps contents to documents instead of mapping documents to contents
- Inverted Index is at heart of how Search engines work
Characteristics of Search engine
- Text Centric
- Read dominant
- Document Oriented
- Large amount of data
- flexible schema (Relation db is not flexible as every document requires same structure
Install and Startup
- Download jar & extract it
- bin/solr start -p 8983
- Start Java webserver listenting at 8983 port
- solr is web appln that runs by default in jetty webserver
Solr
- stores data in documents
- Documents are more flexible than rows in rdbms
- documents can be hirerachical. rdbms needs different tables & rows
- Indexing process
Solr Core
- Single physical index
- directory structure
- server
- solr
- core1
- conf. -> contains managed-schema.xml & solrconfig.xml
- data ->
Solr Document
- Basic unit of solr information
- Json object with key value pairs
- It is similar to dbms table but more flexible
- It can be hierarchical
- It can have array of values
- it can have object as values
- It is denormalized document - all data belonging to an entity is in same document
Indexing Process
- Documented is converted into solr format(json)
- Create sub-directory on configsets directory
- copy configuration from _default directory
Twitter Search appln
Elements of text analysis
- analyzer
- tokenizer
- chain of token filers
Faceted Search
Other points
- If use case is to write faster then use NO SQL db like Cassandra
-
Comments
Post a Comment