Apache solr

download Apache solr

of 40

  • date post

    11-Aug-2015
  • Category

    Software

  • view

    92
  • download

    1

Embed Size (px)

Transcript of Apache solr

  1. 1. Apache Solr Oberseminar, 12.06.2015 Gesellschaft fr wissenschaftliche Datenverarbeitung mbH Gttingen Pter Kirly, pkiraly@gwdg.de
  2. 2. What is Apache Solr? Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene 2
  3. 3. 1999: Doug Cutting published Lucene 2004: Yonik Seeley published Solr 2006: Apache project (2007: TLP) 2009: LucidWorks company 2010: Merge of Lucene and Solr 2011: 3.1 2012: 4.0 2015: 5.0 History in one minute 3
  4. 4. Sister projects Nutch: web scale search engine Tika: document parser Hadoop: distributes storage and data processing Elasticsearch: alternative to Solr forks/ports of Lucene client libraries and tools (Luke index viewer) 4
  5. 5. Main features I Faceted navigation Hit highlighting Query language Schema-less mode and Schema REST API JSON, XML, PHP, Ruby, Python, XSLT, Velocity and custom Java binary outputs HTML administration interface 5
  6. 6. Main features II Replication to other Solr servers Distributed search through sharding Search results clustering based on Carrot2 Extensible through plugins Relevance boosting via functions Caching - queries, filters, and documents Embeddable in a Java Application 6
  7. 7. Main features III Geo-spatial search, including multiple points per documents and polygons Automated management of large clusters through ZooKeeper Function queries Field Collapsing and grouping Auto-suggest 7
  8. 8. Inverted index Original documents: Doc # Content field 1 A Fun Guide to Cooking 2 Decorating Your Home 3 How to Raise a Child 4 Buying a New Car 8
  9. 9. Inverted index Index structure Term Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7 a 0 1 1 1 0 0 0 becomming 0 0 0 0 1 0 0 beginners 0 0 0 0 0 1 0 buy 0 0 1 0 0 0 0 stored as a bit vectorstored as reference to a tree structure 9
  10. 10. Indexing Document ~ RDBM record Fields (key-value structure): types (text, numeric, date, point, custom) indexed, stored, multiple, required field name patterns (prefixes, suffixes, such as *_tx) special fields (identifier, _version_) 10
  11. 11. Indexing formats: JSON, XML, binary, RDBM, ... connections: file, Data Import Handler, API sharding (separating documents into multiple parts) denormalized documents - (almost) no JOIN ;-( copy field catch all field (contains everything) 11
  12. 12. A document example (XML) F8V7067-APL-KIT string Belkin Mobile Power Cord for iPod w/ Dock text electronicsconnector multivalue 19.95 float false boolean 45.18014,-93.87741 geo point 2005-08-01T16:30:25Z date 12
  13. 13. A document example (JSON) { "id": "F8V7067-APL-KIT", "name": "Belkin Mobile Power Cord for iPod w/ Dock", "cat": ["electronics", "connector"], "price":19.95, "inStock":false, "store": "45.18014,-93.87741", "manufacturedate_dt": "2005-08-01T16:30:25Z" } 13
  14. 14. A document example (Solr4j library) SolrServer solr = new HttpSolrServer(http://); SolrInputDocument doc = new SolrInputDocument(); doc.setField("id", "F8V7067-APL-KIT"); doc.setField("name", "Belkin Mobile Power Cord for iPod w/ Dock"); ... solr.add(doc); solr.commit(true, true); 14
  15. 15. Text analysis chain 1) character filters preprocess text pattern replace, ASCII folding, HTML stripping 1) tokenizers split text into smaller units whitespace, lowercase, word delim., standard 1) token filters examine/modify/eliminate stemming, lowercase, stop words, 15
  16. 16. Text analysis chain 16
  17. 17. Text analysis result #Yummm :) Drinking a latte at Caff Grecco in SFs historic North BeachLearning text analysis #yumm, drink, latte, caffe, grecco, sf/san francisco, historic north beach learn, text, analysis 17
  18. 18. Performing queries 1) user enters a query (+ specifies other components) 2) query handler 3) analysis (use similar as in indexing) 4) run search 5) adding components 6) serialization (XML, JSON etc.) 18
  19. 19. Lucene query language *:* ( everything) gwdg name:gwdg name:admin* h?ld ( hold, held) name:administrator~ ( tor, tion) name:Gesellschaft~0.6 (similarity measure) 19
  20. 20. Lucene query language name:Max AND name:Planck name:Max OR name:Planck name:Max NOT name:Planck name:Max Planck name:(Max Planck OR Gesselschaft) Max Planck~3 (within 3 words) so Planck Max, Max Ludwig Planck 20
  21. 21. Lucene query language max planck^10 (weighting) price:[10 TO 20] ( 10..20) price:{10 TO 20} ( 11..19) born:[1900-01-01T00:00.0Z TO 1949-12- 31T23:59.0Z] (date range) 21
  22. 22. Date mathematics indexing hour granularity "born": "2012-05-22T09:30:22Z/HOUR" search by relative time range, eg. last month: born:[NOW/DAY-1MONTH TO NOW/DAY] keywords: MINUTE, HOUR, DAY, WEEK, MONTH, YEAR 22
  23. 23. Faceted search Facets let user to get an overview of the content, and helps to browse without entering search terms (search theorists: browse and search are equally imortant). term/field facet: list terms and counts query facet: run queries, return counts range facet: split range into pieces 23
  24. 24. Term facets &facet=true &facet.field=TYPE "facet_fields":{ "TYPE":[ "IMAGE", 25334764, "TEXT", 16990647, "VIDEO", 702787, "SOUND", 558825, "3D", 21303 ] http://europeana.eu - Europeana portal 24
  25. 25. Term facet Additional parameters: limit, offset for pagination sort (by index or count) alphabetically or frequency mincount filter less frequent terms missing number of documents miss this field prefix such as http to display URLs only f.[facet name].facet.[parameter] overwrites generals 25
  26. 26. Query facets &facet=true& facet.query=price:[* TO 5}& facet.query=price:[5 TO 10}& facet.query=price:[10 TO 20}& facet.query=price:[20 TO 50}& facet.query=price:[50 TO *] "facet_counts":{ "facet_queries":{ "price:[* TO 5}":6, "price:[5 TO 10}":5, "price:[10 TO 20}":3, "price:[20 TO 50}":6, "price:[50 TO *]":0 }, 26
  27. 27. Query facets (zooming) From centuries to years http://pcu.bage.es/ Catlogo Colectivo de las Bibliotecas de la Administracin General del Estado 27
  28. 28. Range facet &facet=true& facet.range=price& facet.range.start=0& facet.range.end=50& facet.range.gap=5 "facet_ranges":{ "price":{ "counts":[ "0.0", 6, "5.0", 5, "10.0", 0, "15.0", 3, "20.0", 2, "25.0", 2, "30.0", 1, "35.0", 0, "40.0", 0, "45.0", 1 ], "gap":5.0,"start":0.0,"end":50.0 }}}} 28
  29. 29. Hit highlighting ?...&hl=true &hl.fl=name &hl.simple.pre= &hl.simple.post= "highlighting": { "SP2514N": { ID "name": [ "SpinPoint P120 SP2514N - hard drive - 250 GB - ATA- 133"]} 29
  30. 30. More like this (similar documents) mlt (more like this) handler: doc ID fields boost limit min length and freq http://catalog.lib.kyushu-u.ac.jp/en/ - Kyushu University library catalog 30
  31. 31. More like this (alternative solution) (DATA_PROVIDER:("NIOD")^0.2 OR what:("IMAGE" OR "Amerikaanse Strijdkrachten" OR "Luchtmacht" OR "Steden - Zie ook: Ruimtelijke ordening, Wederopbouw, Dorpen")^0.8) NOT europeana_id:"/2021622/11607 31
  32. 32. Multilingual search 32
  33. 33. Multilingual search strategies Separate fields by language title_en:horse OR title_de:horse OR title_hu:horse Separate collections (core, shard) per language all core has language settings and same field names /select?shards=.../english,.../spanish,.../french &q=title:horse All language in one field (from Solr 5.0) title:(es|escuela OR en,es,de|school OR school) 33
  34. 34. Multilingual search query translation API rewrited query horse (Hauspferd OR L OR Paard OR ) 34
  35. 35. Relevancy The most important concepts: Term frequency (tf) - how often a particular term appears in a matching document Inverse document frequency (idf) - how rare a search term is, inverse of the document frequency (how many total documents the search term appears within) field normalization factor (field norm) - a combination of factors describing the importance of a particular field on a per-document basis 35
  36. 36. Relevancy score(q,d) = (tf(t in d) idf(t)2 t.getBoost() norm(t,d)) coord(q,d) queryNorm(q) where t = term; d = document; q = query; f = field tf(t in d) = num. of term occurrences in document1/2 norm(t,d) = d.getBoost() lengthNorm(f) f.getBoost() idf(t) = 1 + log (numDocs / (docFreq +1)) coord(q,d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights1/2) sumOfSquaredWeights = q.getBoost()2 (idf(t) t.getBoost())2 see: Solr in Action, p. 67 36
  37. 37. Debug ?...&debug=true ... "debug":{ "rawquerystring":"hard drive", "querystring":"hard drive", "parsedquery":"text:hard text:drive", "parsedquery_toString":"text:hard text:drive", 37
  38. 38. debug "explain":{ "6H500F0": 1.209934 = (MATCH) sum of: 0.6588537 = (MATCH) weight(text:hard in 2) [DefaultSimilarity], result of: 0.6588537 = score(doc=2,freq=2.0), product of: 0.73792744 = queryWeight, product of: 3.3671236 = idf(docFreq=2, maxDocs=32) 0.21915662 = queryNorm 0.8928435 = fieldWeight in 2, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 3.3671236 = idf(docFreq=2, maxDocs=32) ... 38
  39. 39. References http://lucene.apache.org/solr/ Grainger & Potter: Solr in Action https://lucidworks.com/blog/ http://blog.sematext.com/ http://solr.pl/ https://www.packtpub.com/all?search=solr http://www.slideshare.net/treygrainger 39
  40. 40. Happy searching! 40