CSE 8337 SPRING 2011 PROJECT 3 Richa Arora
45 Slides674.87 KB
CSE 8337 SPRING 2011 PROJECT 3 Richa Arora
Agenda Tool Identified and Overview Schema.xml Tokenization, Stop words, and Synonym Handling Indexing Data Import Handler Query format and Matching documents to query Function Queries Bibliography
TOOL IDENTIFIED & OVERVIEW
Tool Identified & Overview SOLR - Open Source enterprise search platform from Apache Lucene project Purpose To implement a full text search functionality in a web application Commercial Websites using SOLR www.digg.com http://www.whitehouse.gov/ - Uses SOLR via Drupal for site search w/highlighting & faceting http://beta.fcc.gov/ http://www.netflix.com/
SOLR Application Web server Database server Web Applicati on Documen t Database SOLR
Features and Technology Features Full text search Rich document handling (including MS Word, PDF, RTF etc.) HTML administration interface Scalable Technology Java programming language Lucene Java search library Runs as a search server within a servlet container such as Tomcat or Jetty
Functioning of SOLR Browser based web interface Documents Search Queries Documents for indexing Search Results Solr Server Searching Indexing schema.xml schema.xml solrconfig.x solrconfig.x ml ml Index
Operations in SOLR Documents form the basic unit of SOLR Documents are composed of fields Examples: Document for Person: Fields – name, height, age, etc. Document for Recipes: Fields – origin, ingredients, etc. Documents are fed to SOLR SOLR extracts the information from the fields in the documents and makes it searchable Steps: Field Analysis Tokenization Filter application Indexing
SCHEMA.XML
schema.xml Governs how should SOLR build indexes from input documents Defines field types and specific fields that the documents can contain Describes how SOLR should handle the fields when adding documents to the index or when querying those fields
Elements of schema.xml schema types fields uniqueKey defaultSearchField solrQueryParser defaultOperator copyField /schema
Analyzers These are used for examining the text of fields and to generate a token stream Indexing Analyzers: The results of the analysis are added to an index and a set of terms like positions, sizes, etc for a field are defined Querying Analyzers: The values being searched for are analyzed and the terms that result are matched against those that are stored in the field's index fieldType name “nametext” class “solr.TextField” analyzer type “index” tokenizer class “solr.StandardTokenizerFactory”/ filter class “solr.LowerCaseFilterFactory”/ filter class “solr.KeepWordFilterFactory” words “keepwords.txt”/ filter class “solr.SynonymFilterFactory” synonyms “syns.txt”/ /analyzer analyzer type “query” tokenizer class “solr.StandardTokenizerFactory”/ filter class “solr.LowerCaseFilterFactory”/ /analyzer /fieldType
TOKENIZATION, STOP WORDS, AND SYNONYM HANDLING
Tokenization To splits a stream of text into tokens Tokens are subsequences of the characters A token contains various metadata in addition to its text value, such as the location at which the token occurs in the field fieldType name "text" class "solr.TextField" analyzer tokenizer class "solr.StandardTokenizerFactory"/ /analyzer /fieldType Example Standard Tokenizer: Treats whitespace and punctuation as delimiters Input: “Email: [email protected]” Output: “Email:”, “[email protected]” N-Gram Tokenizer: Reads the field text and generates n-gram tokens of sizes in the given range (default minimum is 1 and maximum is 2) Input: “hello world” Output: “h”, “e”, “l”, “l”, “o”, “ “, “w”, “o”, “r”, “l”, “d”, “he ”, “el”, “ll”, “lo”, “o “, “wo”, “or”, “rl”, “ld”
Filters Filters take tokens as input from the Tokenizers and produce another stream of tokens as output Multiple filters can be used one after the other Example: fieldType name "text" class "solr.TextField" analyzer tokenizer class "solr.StandardTokenizerFactory"/ filter class "solr.StandardFilterFactory"/ filter class "solr.LowerCaseFilterFactory"/ filter class "solr.EnglishPorterFilterFactory"/ /analyzer /fieldType
Types of Filters
Result of Filter application
Stop Words Handling Stop Filter: This filter is used to discard tokens that are on the given stop words list. A standard stop words list is included in the SOLR config directory, named stopwords.txt, for English language text Example: Using the standard stopwords.txt analyzer tokenizer class "solr.StandardTokenizerFactory"/ filter class "solr.StopFilterFactory" words "stopwords.txt"/ /analyzer Tokenizer Input : “welcome to the world of Solr” Tokenizer Output/Filter Input: “welcome”(1), “to”(2), “the”(3), “world”(4), “of”(5), “Solr”(6) Filter Output: “welcome”(1), “world”(2), “Solr”(3)
stopwords.txt
Synonym Handling Synonym Filter: This is used for finding synonyms at the time of indexing as well as while querying. Tokens are looked up in the list of synonyms and if a match is found, then the synonyms are put in place of the token Example: We can define the synonyms in a file (test synonyms.txt) and use it for comparing the tokens home, dwelling, house shop workshop, store teh the analyzer tokenizer class "solr.StandardTokenizerFactory"/ filter class "solr.SynonymFilterFactory" synonyms “test synonyms.txt"/ /analyzer Tokenizer Input : “teh home shop” Tokenizer Output/Filter Input: “teh”(1), “home”(2), “shop”(3) Filter Output: “the”(1), “workshop”(2), “shop”(2), “home”(2), “dwelling”(3), “house”(3)
INDEXING
Indexing Refers to adding the content to a SOLR index To make the content searchable Sources of data for indexing: XML CSV Rich text formats (PDF, MS Word, MS Excel, text etc.) Data extracted from tables in a database
Posting Data to SOLR Uploading Data with SOLR Cell Using ExtractingRequestHandler With a POST With SOLR Cell and SOLRJ Uploading Data with Index Handlers XMLUpdateRequestHandler for XML-formatted Data Using the CSVRequestHandler for CSV Content Indexing Using SOLRJ Uploading Structure Data Store Data with the Data Import Handler Content Streams
cURL Utility curl posts and retrieves data over HTTP, FTP, and many other protocols In the example below, the Extraction Request Handler is called, uploads the file tutorial.html and assigns it the unique ID doc1 curl “http://localhost:8983/solr/update/extract? literal.id doc1&uprefix attr &fmap.content attr content&commit true” -F "myfile @tutorial.html" literal.id provides a unique ID to the document uploaded to SOLR commit true makes the document searchable after indexing The -F flag instructs curl to POST data using the Content-Type multipart/form-data and supports the uploading of binary files The @ symbol instructs curl to upload the attached file The argument myfile @tutorial.html needs a valid file path
Example – XMLUpdateRequestHandler Order of operation: 1. Modify the schema.xml file to add the fields which may not be already existing in the schema.xml file, example: authors, dd, isbn, yearpub, publisher 2. Modify the schema.xml file to copy the newly created fields to text field to make the search results viewable 3. Run the curl utility with the command for adding XML document: curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary " add doc field name 'id' doc26 /field field name 'authors' Patrick Eagar /field field name 'subject' Sports /field field name 'dd' 796.35 /field field name 'isbn' 0002166313 /field field name 'yearpub' 1982 /field field name 'publisher' Collins /field /doc commit waitFlush 'false' waitSearcher 'false'/ /add "
Uploading Structure Data Store Data with the Data Import Handler Often data is stored in relational databases Data Import Handler (DIH) provides a mechanism to import data from database and to index it DIH can also index content from RSS and ATOM feeds, e-mail repositories and structured XML
Configuration Handler to be registered in the solrconfig.xml file requestHandler name "/dataimport" class "org.apache.solr.handler.dataimport.DataImportHandler" lst name "defaults" str name "config" {solr.config.dir:./solr/conf}/dataimporthandler/dataconfig.xml /str /lst /requestHandler There can be multiple configuration files
DIH Example 1. Create a database in SQL Server 2005 2. The tables and the relationships in the database are shown below
DIH Example 3. Create an XML file called DIH Test.xml for importing into SOLR 4. Modify solrconfig.xml file to instruct SOLR to import data as per the file DIH Test.xml
DIH Example 5. Do a full-import of the DIH from the browser using: http://localhost:8983/solr/dataimport? command full-import
DIH Example 7. 8. Run queries on the newly indexed data from the database Example: http://localhost:8983/solr/select?q i pad2 The above query returns the result. Executing queries on the original database returns similar results
DIH Example – Multiple Datasources
QUERY FORMAT AND MATCHING DOCUMENT TO A QUERY
Searching in SOLR qt: selects a Request Handler for a query using /select Request Handler wt: selects a response writer for formatting the query response Response Writer defType: selects a query parser for the query qf: selects which field to query in the index Query Parser fq: flters the query by applying an additional query to the initial query’s results; caches the results rows: specifies the number of rows to be displayed at run time Index start: specifies an offset into the query results where the returned response should begin
Query Syntax and Parsing - The Standard Query Parser Advantage - Enables the user to specify very precise queries Disadvantage – Is less tolerant of syntax errors than the DisMax query parser Parameters Supported Terms – Use of wild card characters, Fuzzy Searches, Boosts and Ranges Fields – Identified by name followed by a colon Boolean Operators – AND, OR, NOT, &&, !, Common query parameters – debugQuery, defType, explainOther, fl, fq, omitHeader, rows, sort, start, timeAllowed Functions – abs, constant, div, fieldValue, log, linear, max, etc. Faceting Highlighting MoreLikeThis (mlt)
Standard Query Parser Parameters q – Defines a query using standard query syntax. This parameter is mandatory q.op – Specifies the default operator for query expressions (this parameter’s value is defined in schema.xml). Possible values are “AND” or “OR” df – Specifies a default field, overriding the definition of a default field in schema.xml Default parameter values are specified in solrconfig.xml
Sample Responses Example Query http://localhost:8983/solr/select? q id:6H500F0&popularity 6
Term Modifiers – To add flexibility and precision Fuzzy Searches - based on the Levenshtein Distance or Edit Distance E.g. tight will match terms like flight, slight etc. Additional parameter to specify degree of similarity – tight 0.8 will match sight. When set closer to 1, optional parameter causes only terms with higher similarity to be matched If numerical parameter is omitted, the default value taken is 0.5
Term Modifiers – To add flexibility and precision Range Searches Specifies a range(with an upper and lower bound) of values for a field Can be inclusive or exclusive of the lower and upper bounds Query: http://localhost:8983/solr/ select?q popularity:{5 TO 7}
Common Query Parameters Parameter Description defType Query parser to be used (DisMax or Standard Query Parser) Sorts the response to a query in asc or desc order based on response’s score or other characteristic Offset into the responses at which solr should begin displaying content Number of rows of responses displayed at a time Filter query for search results Limits responses to a listed set of fields Sort Start Rows fq fl
Common Query Parameters Parameter debugQuery timeAllowed omitHeader wt Description Include debugging information Time allowed for a query to be processed. If time elapses before response is complete are returned, partial information returned Excludes header information from returned results Specifies the response writer
Function Queries Used to generate a relevancy score using the actual value of one or more numeric fields Functions available for function queries abs – abs(x); abs(-5) constant - 1.5; val :1.5 div – div(1,y); div(sum(x,100), max(y,1)) linear – linear(x, m, c); linear(x, 2, 4) returns 2*x 4 log – log(x); log(sum(x,100)) Include function query in a SOLR query With a val keyword – e.g. val :myNumericField Parameter with an explicit type of FunctionQuery (DisMax query parser’s bf parameter)
Function Query - Example http://localhost:8983/solr/select/? q cat:electronics val :”div(price,weight)”&fl *,score
Response Writers Generated a formatted response of a search wt parameter sets the response writer Response writers supported Json Php Phps Python Ruby Xml xslt
Bibliography http://wiki.apache.org/solr/FrontPage (link last accessed on 04/25/2011) Lucid Works SOLR Reference Guide 1.4 http:// www.lucidimagination.com/user download/c ertified/cdrg/lucidworks-solr-refguide-1.4.pdf (link last accessed on 04/25/2011)