php and mysql full text search across multiple tables using a relevancy score

the setup

I am currently building a paltfrom to host my air applications on. this is plattform includes a feedback and bug tracking forum application where users can post their thoughts and problems. In order for this to be helpfull i needed to implement full text search. the forum constists of basically two tables.

  • 1.) threads (title,date,author_id,forum_id,threadComment_id)
  • 2.) comments (text, author, date,thread_id)

so the way this works is that a user can open up a thread , which stores a new thread and a comment in the database. the two are linked by adding the thread Id to the comment and the comments unique id to the thread. any futur comments that are posted to this thread will simply have the thread Id stored in them.  your table structure might look differnt but the general principle remains that tables are linked to each other by the unique ids of the tables rows.

full text search

the easiest way to search through text in mysql is by doing something like this

SELECT * FROM comments WHERE text LIKE %$SEARCH_QUERY%

this will give you all the comments that have a specific term in their text field. this will not help us much in our case since relevant information is spread out across two tables. I want to search the title and the text of the comment. so we will use this mySQL Statement

MATCH (field1,field2) AGAINST(query 1,query 2)

this uses a very sophisticated algorithm and ignores words like “so, with, and…”. check the mysql docs for details, but this is not the point here. We want to search across multiple tables and MATCH AGAINST only works on one table at a time. dam it I thought when I realized this. but there is a quite simple and fairly performant solution.

relevancy score

this concept makes it possible to assign a numeric value to each row of a table and then add up the linked table rows scores to a combined relevancy score that can be used to filter the search results. first you want to split up you search query. I will illustrate this on the setting I described above.

$tR=”MATCH(threads.title) AGAINST (‘$s’IN BOOLEAN MODE) “;

so first we create a score for the threads title by doing a simple match against. this will return me a numeric value. note: this is just a php string that will later be added to the final sql query.

$cR=”(SELECT SUM(MATCH(comments.text) AGAINST (‘$s’IN BOOLEAN MODE)) as icR FROM comments WHERE comments.threadId=threads.id)”;

now the magic happens. I go through the comments table and add the score of all those tables that match my current thread id. this chunk will also return me a numeric value.

$sql=”SELECT *,($tR*1.1+$cR) as R  FROM threads HAVING R>0.2 ORDER BY R DESC “;

this is not the final sql statment with everything put together. lets examine what happens:

($tR*1.1+$cR) as R

here I simply add the scores of the title and the scores of the comments of the thread to a single value R ( Relevancy), I can even weigh the scores differently by a factor. In my case I want the title to be more important than the comments.

HAVING R>0.2

by using this peace of code I will throw out all the records that have a score lower than 0.2 there by giving the user only results that are at least some what relevant to his search query.

conclusion

my example is a really simple one but it illustrates really nicely how to achieve a decent result with very little code. since the scores are calculated seperatly for each individual table I search in it can be used on way more than just two tables.

June 24 2009 02:07 pm | Tech Questions

Trackback URI | Comments RSS

Leave a Reply