Home > Mobile >  Java: alternative to String.contains that can return similarity
Java: alternative to String.contains that can return similarity

Time:11-28

I have three strings

String a = Hello, how are you doing?
String b = Can I as you something?
String c = Hello, how are you doing? Can I ask you something?

My goal is to evaluate if String c is a merge of String a and b. Note that there is a typo in String b where "as" should be "ask".

The current logic is (pesudo-code):

if 
  String c contains String a AND String b
then 
  merge = true

The problem I have is that if there is a slight change in string c during the merge, the String.contains() is no longer valid as it returns false while checking String b.

Is there a possibility / idea to use an alternative and valid my example?

I was trying with string similarity (Jaccard, etc.) but those are not working as the size of a, b and c can vary so it is in easy / possible to get the right similarity percentages.

CodePudding user response:

There isn't any built-in function (that i found) that does this, but i came up with something that hopefully suits what you need. You can obviously change this (i tried to make it as clean as possible)

Step one: we need a function that takes in two strings and returns the number of differences in the two. I came up with this very simple function:

public static int getNumberDifferences(String a, String b)
{
    int maxLength = Math.max(a.length(), b.length());
    int minLength = Math.min(a.length(), b.length());
    int result = maxLength - minLength;//the difference in length between the two

    for(int i = 0; i < minLength; i  )
    {
        if(a.charAt(i) != b.charAt(i)) //If the characters are different
            result  ; //Add one to the result
    }

    return  result;
}

So in short we iterate through the string and add one to the number of differences every time we encounter a difference. (Note that at the start i take the difference of the lengths of the two strings, so this also counts the difference in size)

Step 2: We need another function that takes in every word (in an array) and returns every difference it encounters. I came up with another super-simple function for this:

    public static int getNumberDifferences(String[] a, String[] b)
{
    int result = 0;

    for(int i = 0; i < Math.min(a.length, b.length); i  )
    {
        result  = getNumberDifferences(a[i], b[i]);
    }

    return result;
}

in this function we simply add all of the differences between every one of the words in the strings.

And finally, we display that:

    public static void main(String[] args)
{
    String a = "Hello, how are you doing?" ;
    String b = "Can I ask you something?";
    String c = "Hello, how are you doing? Can I ask you something?";

    int differences = getNumberDifferences(
            (a   " "   b) //Join the two strings with a space in the middle
                    .split(" "), //Split them to take every word
            c.split(" ")); //Split c as well

    System.out.println(differences);
}

So the final code is this:

public class Main {

public static void main(String[] args)
{
    String a = "Hello, how are you doing?" ;
    String b = "Can I ask you something?";
    String c = "Hello, how are you doing? Can I ask you something?";

    int differences = getNumberDifferences(
            (a   " "   b) //Join the two strings with a space in the middle
                    .split(" "), //Split them to take every word
            c.split(" ")); //Split c as well

    System.out.println(differences);
}

public static int getNumberDifferences(String[] a, String[] b)
{
    int result = 0;

    for(int i = 0; i < Math.min(a.length, b.length); i  )
    {
        result  = getNumberDifferences(a[i], b[i]);
    }

    return result;
}

public static int getNumberDifferences(String a, String b)
{
    int maxLength = Math.max(a.length(), b.length());
    int minLength = Math.min(a.length(), b.length());
    int result = maxLength - minLength; //the difference in length between the two

    for(int i = 0; i < minLength; i  )
    {
        if(a.charAt(i) != b.charAt(i)) //If the characters are different
            result  ; //Add one to the result
    }

    return  result;
}

}

Please let me know if this helped :)

CodePudding user response:

How correctly marks in comments, you must have comparing with Levenshtein distance.

You want to comparing 2 string with using similarity percentages, so we may correlate this percentages as relation distance between strings to length of reference string. So, if we'll require 100% similarity ours strings must be ab solutely equals and we will have distance between string as 0. And opposite: if we'll require 100% similarity ours strings must be absolutely different and we will have distance almost as length of reference string (or more).

I'm named similarity percentages as allowedDiscrepancy since it's more informative. So, my code has methods distance for calculating distance between reference string and another and compareWithDiscrepancy method, to correlation. Check this, it works.

public class StringUtils {
    public static void main(String[] args) {
        final String a = "Hello, how are you doing?";
        final String b = "Can I as you something?";
        final String c = "Hello, how are you doing? Can I ass you something?";

        // allowedDiscrepancy = 1.0 (100%) - strings might be absolutely different
        //So, we have 2 strings with little difference, so it must be return "true"
        assertTrue(compareWithDiscrepancy(c, String.format("%s %s", a, b), 1.0));
        // allowedDiscrepancy = 0.0 (0%) - strings must be absolutely equals
        //So, we have 2 strings with little difference, but more than 0, so it must be return "false"
        assertFalse(compareWithDiscrepancy(c, String.format("%s %s", a, b), 0.0));

        final String sameA = "Hello.";
        final String sameB = "How are you?";
        final String sameC = String.format("%s %s", sameA, sameB);

        // allowedDiscrepancy = 1.0 (100%) - strings might be absolutely different
        //So, we have 2 strings absolutely equals, so it must be return "true"
        assertTrue(compareWithDiscrepancy(sameA, String.format("%s %s", sameA, sameB), 1));
        // allowedDiscrepancy = 0.0 (0%) - strings must be absolutely equals
        //So, we have 2 strings absolutely equals, so it must be return "true" too
        assertTrue(compareWithDiscrepancy(sameC, String.format("%s %s", sameA, sameB), 0));

        final String differentA = "Part 1.";
        final String differentB = "Part 2.";
        final String differentC = "Absolutely different string";

        // allowedDiscrepancy = 1.0 (100%) - strings might be absolutely different
        //So, we have 2 absolutely different strings, so it must be return "true"
        assertTrue(compareWithDiscrepancy(differentC, String.format("%s %s", differentA, differentB), 1));
        // allowedDiscrepancy = 0.0 (0%) - strings must be absolutely equals
        //So, we have 2 absolutely different strings, so it must be return "false" too
        assertFalse(compareWithDiscrepancy(differentC, String.format("%s %s", differentA, differentB), 0));

        System.out.println("Done!");
    }

    public static boolean compareWithDiscrepancy(final String referenceString, final String testedString, double allowedDiscrepancy) {
        if (allowedDiscrepancy < 0) allowedDiscrepancy = 0;
        if (allowedDiscrepancy > 1) allowedDiscrepancy = 1;

        int distance = distance(referenceString, testedString);
        double realDiscrepancy = distance * 1.0 / referenceString.length();
        if (realDiscrepancy > 1) realDiscrepancy = 1;
        return allowedDiscrepancy >= realDiscrepancy;
    }

    static int distance(String x, String y) {
        int[][] dp = new int[x.length()   1][y.length()   1];

        for (int i = 0; i <= x.length(); i  ) {
            for (int j = 0; j <= y.length(); j  ) {
                if (i == 0) {
                    dp[i][j] = j;
                } else if (j == 0) {
                    dp[i][j] = i;
                } else {
                    dp[i][j] = min(dp[i - 1][j - 1]
                              cost(x.charAt(i - 1), y.charAt(j - 1)),
                        dp[i - 1][j]   1,
                        dp[i][j - 1]   1);
                }
            }
        }

        return dp[x.length()][y.length()];
    }

    public static int cost(char a, char b) {
        return a == b ? 0 : 1;
    }

    public static int min(int... numbers) {
        return Arrays.stream(numbers)
            .min().orElse(Integer.MAX_VALUE);
    }
}

  • Related