Home > Back-end >  How could I provide more accurate search results?
How could I provide more accurate search results?

Time:07-09

I am trying to implement a search bar for a web application. What I currently have is an input that receives a String, which I then attempt to tokenize into an Array of String objects, like so:

function tokenize(string: string) {
  string = string.replace(/[\W_]/g, ' ');
  string = string.replace(/\s\s /g, ' ');

  const tokens: string[] = [];
  string.split(' ').forEach((element) => element && tokens.push(element));

  return tokens;
}

What I end up with is an Array that contains the search terms that the user had provided. I do not include any non-word characters, as they are generally unimportant in this scenario. These are some examples of what the tokenize function would return when given an input that contains non-word characters:

Input: "Zero 7 - Give It Away"
Output: [ "zero", "7", "give", "it", "away" ]

Input: "Thievery Corporation - Claridad (feat. Natalia Clavier)"
Output: [ "thievery", "corporation", "claridad", "feat", "natalia", "clavier" ]

My next course of action would be to iterate over each element of the return value and compare them to the titles of the available tracks (objects that contain track data), to see if the element is part of any title (probably through a regular expression).

Should that be true, then I would append that specific track to the search results. Otherwise, I would simply move on to compare the next element of the now tokenized Array of String objects, until every search term had been tested.

The issue I have with this approach is that it would not provide the accuracy that I would ideally want to have. Say for instance that there were multiple tracks by the artist Zero 7. At first, that would return any track created by them - which is exactly what one would expect, but what if the user provided even more search terms?

As an example, a user could type "zero 7 - give" into the search bar, which would return any track by Zero 7. Actually, it is not even that precise, since the word "zero" and the number "7" would be tested for other tracks as well - and so would the word "give".

This is an issue that I have trouble with, since I do not know how to apply my tokenized values in a way that provides more accurate search results. Note that I am not after any code, but instead a breakdown of the steps that would be necessary to facilitate accurate search results (at least better than those I can currently provide).

My objective is to be somewhat flexible, yet accurate when it comes to the search results. I would still want the search terms "zero" and "7" to return any track by that artist. However, a more precise search, like "zero", "7", and "give" should ideally only return tracks by the same artist that also have "give" in their title.

I would not want to match "give" to any other track if the artist can be determined - which is entirely possible, since each track does include the name of the artist that produced the track. This is kind of where my creativity has come to an impasse, as I cannot seem to puzzle the pieces together.

Any help would be highly appreciated. I have never created a more advanced search bar, but I would like to improve my knowledge in this area. I think that the data is all there, however, the approach is lacking.

CodePudding user response:

Basically, what you're looking for is relevance scoring (or a "ranking function"). The idea is, we assign each "document" (track name in your case) some numeric "score" based on keywords and display documents with sorted by their score.

To compute the simplest possible ranking function, we add 1 to the score for each keyword that occurs in a document:

for (doc of documents)
   doc.score = keywords.reduce((score, kw) => score   doc.includes(kw))
documents.sort(...by score, reversed...)

So if there are 5 keywords, the documents that contain all 5 will get the score 5, and documents that contain none will be scored 0.

There are many possible improvements to this method. For example, you can give very common words, like "the", lower priority than rare words. To do that, you compute the so-called IDF for each keyword, which is the total number of documents divided by the number of docs that contain this keyword, and add keywords' IDFs to the total score rather than just 1.

Another idea is to give advantage to documents that contain keywords in the specific order, e.g. when the search is "red socks", documents like "blue shirt and red socks" will be ranked higher than documents like "blue socks and red dress". To achieve that, you compute the score for "n-grams" (sequences of keywords) first and resort to individual keywords if nothing could be found.

Further info:

  • Related