Home > Blockchain >  How to build a search filter that validates even partially incorrect values?
How to build a search filter that validates even partially incorrect values?

Time:09-27

I've been using a SearchView in order to filter items in a RecyclerView, however I'm only validating strings containing the value of the string I filter with. What I want is to be able to validate strings even if there are some missclicks and 1/2 letters are incorrect (for example if I wrote "Hellp World" instead of "Hello World" I still wanna have the "Hello World" string to be valid).

I've been searching how to do that for a while, and the only thing I found was using a phonetic match for the strings, and then filter using that, but I couldn't find how to implement it in Kotlin, I only saw it used in PHP, and I don't even think it really does what I want.

I've had a hard time putting my need into words so I hope this is clear.

My current filter is taking a list of strings and filtering based on if the search value is contained in these strings, like so:

val filteredList = ArrayList<ShowModel>()
for(str in stringsList) {
    if(str.lowercase().contains(filterPattern)) {
        filteredList.add(str)
    }
}

How should I go about building a less picking search filter ?

CodePudding user response:

(this is a bit long but I'm basically talking about why this is a complex problem - there are some actual solutions you can use in the end bit!)

The problem is that computers are exact and functions like filters are based on meeting very specific conditions. They don't do fuzziness, so if you want that, you'll have to code that behaviour in yourself (unless you wanted to train a neural network until it seems like it does what you want, but that's a whole other thing)

So you need to be precise about what you want, so you can write code that does that specific thing. You've mentioned three things, but only vaguely:

  • handling "misclicks"
  • letting 1/2 of the characters be "incorrect"
  • phonetic matching

To you, a human, those things might be intuitively simple - you have an idea of what a "misclick" is, you know what words (or word fragments) sound like and how they could be confused with others, you have an idea of what it means for half of the letters to be "wrong". A computer has no idea about those things, so you need to tell it how to handle those scenarios, and exactly what it needs to do to identify matches correctly. This is not a simple thing!


Let's take the "half wrong" example. Even from the start, you and I might have different ideas of what that actually means - are we talking 50% of the characters can be completely different? Should a pair of swapped letters count as equally wrong as a pair of random letters, i.e. cmopuetr is just as wrong as cvzpuqrr? How would you score those two words as a match for computer? What specific steps would you take to do that?

How about partial matches, where you have letters missing from the end? Should plan be a closer match to plane than flan? I feel like it should be! But all these details make the comparison harder. If it were a simple "incorrect characters count" you could do something (naive) like

fun String.scoreMatch(other: String) =
    zip(other).count { it.first == it.second } / maxOf(length, other.length).toFloat()

to get the percentage of characters in the longer word that are in the correct place - but even that favours plane over flan (80% vs 75% because of the extra length), and it doesn't work for e.g. a missing letter at the start (every character is shifted, and therefore "wrong", similar to the issue of swapped characters). See how this can start to get complex real fast? And this is before we even start thinking about efficiency or anything!


I'm not telling you not to try this, just trying to give you a sense of the scale of the problem, and how vague the problem space actually is (which itself makes the problem hard to solve, since you don't know what you need to do!) But if you can relax your requirements, there are a bunch of ready-made solutions you could maybe use.

There are some common algorithms out there, like calculating the Levenshtein distance, which will give you some form of fuzzy matching - that might be a good solution for you, in that it doesn't do all the things you mentioned, but it gives you some kind of smart lookup, and maybe that's enough!

Here's a blog that explains some fuzzy matching algorithms, including a phonetic one you could look into if you really want that. The blog is trying to sell you their own AI matching solution, but it gives you a few things to look into!

The Levenshtein algorithm is something you could implement yourself if you want, or there'll be a bunch of libraries out there that do it - maybe there's a nice fast one. Or if you're happy to use the Apache Commons library they have a text.similarity package with a bunch of different distance and scoring algorithms you can use. There might be a specialised, more focused library out there though - have a look around! Here's a Java port of a Python library I've seen come up a bit.

Bear in mind this could be heavy processing, depending on how many strings you have to check, so you might want to implement some kind of rate limiter on the search - delay the search and reset that delay if the user keeps typing, that kind of thing

  • Related