1 CS520 Web Programming Full Text SearchChengyu Sun California State University, Los Angeles
2 Search Text Web search Desktop search ApplicationsSearch posts in a web forum Search product descriptions at an online retailer …
3 Database Query Find the posts regarding “SSHD login errors”.select * from posts where content like ‘%SSHD login errors%’; Here are the steps to take to fix the SSHD login errors: … Please help! I got SSHD login errors!
4 Problems with Database QueriesPlease help! I got an error when I tried to login through SSHD! There a problem recently discovered regarding SSHD and login. The error message is usually … The solution for sshd/login errors: … And how about performance??
5 Full Text Search (FTS) More formally known as Information Retrieval (IR) Search LARGE amount of textual data
6 Characteristics of FTSVs. Databases “Fuzzy” query processing Relevancy ranking
7 Accuracy of FTS # of relevant documents retrieved Precision =# of documents retrieved # of relevant documents retrieved Recall = # of relevant documents
8 Stripping non-textual dataJourney of a Document document Stripping non-textual data tokenizing Removing stop words Stemming index Indexing
9 Document Original Text-only
The solution for
sshd/login errors: … The solution for sshd/login errors: …10 Convert Different Document Types to TextHTML CyberNekoHTML PDF Apache PDFBox MS Word Apache POI More at Lucence FAQ - Indexing Q28-32
11 Extract Text from PDF and MS Word DocumentsFTS Example Pdf2Txt Doc2Txt
12 Tokenizing [the] [solution] [for] [sshd] [login] [errors] …
13 Chinese Text Example Text: 今天天气不错。 Unigram: [今] [天] [天] [气] [不] [错]Bigram: [今天] [天天] [天气] [气不] [不错] Grammar-based: [今天] [天气] [不错]
14 Stop Words Words that do not help in search and retrievalFunction words: a, an, and, the, of, for … After stop words removal: [the] [solution] [for] [sshd] [login] [errors] … Problem of stop word removal??
15 Stemming Reduce a word to its stem or root form. Examples:connection, connections connected, connecting connective connect [solution] [sshd] [login] [errors] … [solve] [sshd] [login] [error] …
16 Inverted Index cat documents dog keywords buckets # of words positions22, 137 234 cat documents dog keywords buckets
17 Query Processing Query tokenizing Removing stop words StemmingSearching results Ranking
18 Relevancy (Similarity)How well the document matches the query E.g. weighted vector distance How “important” the document is E.g. based on ratings, citations, and links
19 Example: Lucene Similarity Score # of times a term appears in a document # of documents that contain the term # of query terms found length of a field boost factor - field and/or document …
20 FTS Implementations Databases Stand-alone IR librariesMySQL: MyISAM tables only PostgreSQL (since 8.3) Oracle, DB2, MS SQL Server, ... Stand-alone IR libraries Lucene, Egothor, Xapian, MG4J, …
21 FTS in PostgreSQL Since 8.3tsearch/tsearch2 module before 8.3
22 Text Search ConfigurationSpecify the options to transform a document to a tsvector – tokenization, stop words removal, stemming etc. psql commands \dF show default_text_search_config; set default_text_search_config=english; Change default text search configuration in $DATA/postgresql.conf
23 Sample Schema create table messages ( id serial primary key,subject varchar(4092), content text, author varchar(255) );
24 Basic Data Types and Functionstsvector tsquery Functions to_tsvector to_tsquery plainto_tsquery
25 Query Syntax plainto_tsquery to_tsquery full text search
26 The Match Operator @@ tsvector @@ tsquery tsquery @@ tsvectortext tsquery to_tsvector(text) tsquery text text to_tsvector(text) plainto_tsquery(text) Note that there is no tsquery text.
27 Query Examples Find the messages that contain “computer programs” in the content Find the messages that contain “computer programs” in either the subject or the content
28 Create an Index on Text Column(s)create index messages_content_index on messages using gin(to_tsvector('english',content)); Expression (function) index The language parameter is required in both index construction and query
29 Use a Separate Column for Text SearchCreate a tsvector column Use a trigger to update the column
30 Create an Index on the tsvector Columncreate index messages_tsv_index on messages using gin(tsv); The language parameter is no longer required
31 More Functions setweight(tsvector, ”char”) ts_rank(tsvector, tsquery)B: 0.4 C: 0.2 D: 0.1 ts_rank(tsvector, tsquery) ts_headline(text, tsquery)
32 Function Examples Set the weight of subject to be “A” and the weight of content to be “D” List the results by their relevancy scores and highlight the query terms in the results
33 Using Native SQL in JPA String sql = “select * from employees where id = ?”; entityManager.createNaiveQuery(sql, Employee.class) .setParameter(1, employeeId) .getResultList();
34 Named Query in Entity Class@Table( name=“employees” ) @NamedQueries({ @NamedQuery( name=“employee.findAll”, query=“select * from employees” ), @NamedQuery( name=“employee.findById”, query=“from Employee where id = :id” ) }) public class Employee { …. } A named query can be JPQL or SQL.
35 Named Query in Hibernate Mapping File
36 Using Named Query in DAOentityManager .createNamedQuery(“employee.findAll”, Employee.class) .getResultList(); entityManager .createNamedQuery(“employee.findById”, Employee.class) .setParameter( “id”, employeeId ) .getSingleResult();
37 Example: Course Search in CSNS2csns-create.sql CourseDao and CourseDaoImpl NamedQueries.hbm.xml