Count occurrence of multiple substrings-CodePudding

I want to count the occurrence of multiple substrings in a string. I am able to do so by using the following code:

int score = 0;
String text = "This is some random text. This is some random text.";
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
for (String word : words) {
    if(StringUtils.containsIgnoreCase(text, word)){
        score  = 1;
    }
}

My algorithm increases the score by 1 for each word from my "words" list that occurs in the text. In my example, the score would be 2 because "this" and "is" occur in the text.

However, my code has to loop through the text for each string in my list. Is there a faster way to do this?

CodePudding user response：

How about the following:

String text = "This is some random text. This is some random text.";
text = text.toLowerCase();
String[] tokens = text.split("\\PL ");
java.util.Set<String> source = new java.util.HashSet<>();
for (String token : tokens) {
    source.add(token);
}
java.util.List<String> words = java.util.Arrays.asList("this", "is", "for", "stackoverflow");
source.retainAll(words);
int score = source.size();

Split text into words.
Add the words to a Set so that each word appears only once. Hence Set will contain the word this only once despite the fact that the word this appears twice in text.
After calling method retainAll, the Set only contains words that are in the words list. Hence your score is the number of elements in the Set.

CodePudding user response：

The fastest way would be to map each words of the text.

Therefore for each word in the words that you are searching for, you just have to look up for the keys in the hashmap.

Given your text has n word and your words has m words The solution would take O(n m) instead of O(n*m)

CodePudding user response：

This is a case where regex is your friend:

public static Map<String, Integer> countTokens(String src, List<String> tokens) {
    Map<String, Integer> countMap = new HashMap<>();
    String end = "(\\s|$)"; //trap whitespace or the end of the string.
    String start = "(^|\\s)"; //trap for whitespace or the start of a the string
    //iterate through your tokens
    for (String token : tokens) {
        //create a pattern, note the case insensitive flag
        Pattern pattern = Pattern.compile(start token end, Pattern.CASE_INSENSITIVE);
        Matcher matcher = pattern.matcher(src);
        int cnt = 0;
        //count your matches.
        while(matcher.find()) {
            cnt  ;
        }
        
        countMap.put(token, cnt);
    }
    
    return countMap;
}


public static void main(String[] args) throws IOException {
    String text = "This is some random text. This is some random text.";
    List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
    
    for (Entry<String, Integer> entry : countTokens(text, words).entrySet()) {
        System.out.println(entry);
    }
}

If you want to find tokens within token, like "is" within "this", simply remove the start and end regex.

CodePudding user response：

You can use split method, to convert string to Array of Strings, sort it, and then you can binary search the elements of list in the array this has been implemented in the given code.

String[] wordsArr =  "This is some random text. This is some random text.".toLowerCase().split(" ");
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
int count = 0;
Arrays.sort(wordsArr);
for(String word: words)
    if(Arrays.binarySearch( wordsArr, word )>-1)
         count  ;

Another good approach can be to use a TreeSet, this one I got inspiration from @Abra

String[] wordsArr =  "This is some random text. This is some random text.".toLowerCase().split(" ");
List<String> words = Arrays.asList("this", "is", "for", "stackoverflow");
TreeSet<String> setOfWords = new TreeSet<String>(Arrays.asList(wordsArr));
int count = 0;
for(String word: words)
    if(setOfWord.contains(word))
        count  ;

Both these methods have a Time Complexity of O(Nlog(M)), N being the size of words array, M being the size of wordsArr or setOfWords, However do be careful, using this method since this does have one flaw, which is quite obvious, It doesn't account for periods, thus, if you were to search for "text", it won't be found, because, The set/array contains "text.". You can get around that by removing all the punctuations from your initial text and searched text, however, if you do want it to be accurate then, you can set the regex string in split() to be "[^a-zA-Z]" this will split your String around non alphabetical characters.