Home > Mobile >  regex verification of two strings in stream
regex verification of two strings in stream

Time:10-14

I have 2 Lists,

  1. Is a List of Strings called tablerows.

  2. Is a List of Objects, which each contain a list of String Values (among other things)

I am trying to iterate over them via streams as follows:

tableRows.forEach(row -> objects.forEach(o -> o.getValues().forEach(v ->  {

The Task is to find all the values "v" and in which tablerow they are used (if they are), and to afterwards call a method which processes the tablerow, the value that is used within, and the object name that it belongs to.

    if (doesContain(row.getText(), v.getValue())) {
        methodCall....
        }

which is currently verified through regex Patterns

doesContain(String string, String substring) {
        String pattern = "^"   substring  " \\b|\\b "   substring  " \\b|\\b "   substring  "$";
        Pattern p = Pattern.compile(pattern);
        Matcher m = p.matcher(string);
        return m.find();
}

The Code works, the thirst List however consists of 3000-6000 Strings (and more in the future), while the second list has around 100 objects, each having around 20 strings in their list to check against...

It currently takes around 1.3 minutes for a list of 3000 strings to complete, while a timeout is beeing called after 1 minute.

The major bottleneck seems to be due to the use of Pattern and Matcher, taking around 40 seconds. The use of word boundary is necessary, so i can not use equals, contains or similar methods. My thought was to somehow use predicates directly in the stream, so far without success.

Is it possible to use regex directly in the stream, while preserving the "row" and "v" values that are connected to each other, and then calling the method for each succesfull pair afterwards?

Or is there an alternative approach that could be faster?

edit: example: main:

public class main {

    public static void main(String[] args) {


        ArrayList<String> tableRows = new ArrayList<>();
        tableRows.add("Text one");
        tableRows.add("Text two");
        tableRows.add("Text three");
        tableRows.add("Text four");
        tableRows.add("Text five");

        ArrayList<String>  valueList = new ArrayList<>();
        valueList.add("one");
        valueList.add("two");
        valueList.add("none");
        valueList.add("four");
        valueList.add("abc");
        ArrayList<testObject> objects = new ArrayList<testObject>();
        objects.add(new testObject("thirstName", valueList));
        objects.add(new testObject("secondName", valueList));

        String replacePattern = Pattern.quote(".")   "|"   Pattern.quote("?")   "|"   Pattern.quote("!")   "|"  
                Pattern.quote(",")   "|"   Pattern.quote(";");
        tableRows.forEach(row -> objects.forEach(o -> o.getValues().forEach(v -> {

            if (doesContain(row, v,replacePattern)) {
                System.out.println("Value :"   v   " of Object "   o.getName()   " is used in row: "   row);
            }
        })));

    }
    private static boolean doesContain(String string, String substring,String replacePattern) {

        string = string.toLowerCase().replaceAll(replacePattern, "");
        substring = substring.toLowerCase();
        String pattern = "^"   substring   " \\b|\\b "   substring   " \\b|\\b "   substring   "$";
        Pattern p = Pattern.compile(pattern);
        Matcher m = p.matcher(string);
        return m.find();
    }
}

testObject class:

public class testObject {

    private String name;
    private ArrayList<String> values;


    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public ArrayList<String> getValues() {
        return values;
    }

    public void setValues(ArrayList<String> values) {
        this.values = values;
    }

    public testObject() {
        this.name = "testName";
        this.values = (ArrayList<String>) List.of("one","two");
    }

    public testObject(String name, ArrayList<String> values) {
        this.name = name;
        this.values = values;
    }
}

example output:

Value:  one  of Object: thirstName is used in row: Text one
Value:  one  of Object: secondName is used in row: Text one
Value:  two  of Object: thirstName is used in row: Text two
Value:  two  of Object: secondName is used in row: Text two
Value:  four  of Object: thirstName is used in row: Text four
Value:  four  of Object: secondName is used in row: Text four

CodePudding user response:

When you want to ensure that a match is a whole word, you should simply add the word boundary anchor to the word, instead of adding spaces and trying to apply the anchor to the previous and next word. This simplifies the pattern to match word but not words to "\\bword\\b". Further, you can combine all words of a list to a single pattern like "\\b(one|two|none|four|abc)\\b" to search for the first occurrence of either of those words. This eliminates the need to iterate over the words. But it may report a different word if more than one of these words occurs in the string.

When you change the iteration logic to objects.forEach(o -> tableRows.forEach(row -> …)) you need to construct each pattern only once instead repeatedly for each row. But this will print the results in a different order, of course.

Then, avoid the widespread anti-pattern of converting all strings to uppercase or lowercase to effectively perform a case insensitive search. Java has dedicated methods for case insensitive searching, I don’t know why so many developers ignore them and prefer to perform expensive conversions instead. Mind that the search can stop at the first occurrence, whereas the toLowercase() conversion has to process the entire string, potentially producing a new string, before the search even has started.

Bringing these points together, you can do the operation like:

objects.forEach(o -> {
    Pattern p = Pattern.compile(
        o.getValues().stream().collect(Collectors.joining("|", "\\b(", ")\\b")),
        Pattern.CASE_INSENSITIVE);
    tableRows.forEach(row -> {
        Matcher m = p.matcher(row);
        if(m.find()) {
            System.out.println("Value: "   m.group()
                  " of Object "   o.getName()   " is used in row: "   row);
        }
    });
});

This is constructing the combined pattern from the list returned by getValues() for each testObject instance. Since the pattern may match any of those words, we have to query the matcher (m.group()) to find out which word we found. Mind that there could be more occurrences. This match operation stops at the first word found in the string whereas your original code stopped at the first word of the list that had been found, even if it occurred later in the string than another word that occurs later in the list.

If you are performing this operation multiple times, you may even store the pattern within the testObject itself, e.g.

public class TestObject {
    private String name;
    private List<String> values;
    private Pattern pattern;

    public TestObject() {
        this("testName", List.of("one","two"));
    }

    public TestObject(String name, List<String> values) {
        this.name = name;
        setValues(values);
    }

    public String getName() {
        return name;
    }
    public void setName(String name) {
        this.name = name;
    }
    public List<String> getValues() {
        return values;
    }
    public void setValues(List<String> values) {
        this.values = List.copyOf(values);
        pattern = Pattern.compile(
            this.values.stream().collect(Collectors.joining("|", "\\b(", ")\\b")),
            Pattern.CASE_INSENSITIVE);
    }
    public Pattern getPattern() {
        return pattern;
    }
}

I took the opportunity to fix some other issues of this class, like not assuming that every List is an ArrayList. It also always uses an immutable list for its internal storage, to ensure that no-one can change it without a corresponding Pattern update.

Besides keeping the patterns between different match operations, this also allows to check the objects in the same order as your original code, e.g.

tableRows.forEach(row -> objects.forEach(o -> {
    Matcher m = o.getPattern().matcher(row);
    if(m.find()) {
        System.out.println("Value: "   m.group()
              " of Object "   o.getName()   " is used in row: "   row);
    }
}));
  • Related