rlike regex not handling special characters-CodePudding

I am having difficulty in getting (java) spark's rlike method to handle special characters. Normally a simple escape \\ or quoting with \Q and \E would be sufficient for handling special characters in the regex but they don't seem to work.

Take the example code below. It should show rows 2 & 3 as matching one of the items in alist ignoring case. However the code below only shows row 2 as matching.

How can I filter the rows of the dataset based on the "animal" columns match to one of the items in alist?

    StructType schema = new StructType(new StructField[]{
            new StructField("row_id", DataTypes.IntegerType, false, Metadata.empty()),
            new StructField("animal", DataTypes.StringType, false, Metadata.empty())
    });

    Dataset<Row> dataset = spark.createDataFrame(
            Arrays.asList(
                    RowFactory.create(1, "Bat"),
                    RowFactory.create(2, "Dog"),
                    RowFactory.create(3, "Cat (Type Not Stated)"),
                    RowFactory.create(4, "Other.")
            ), schema);


    List<String> alist = Arrays.asList(
            "\\QDOG\\E",
            "\\QCat (Type Not Stated)\\E");

    dataset = dataset.filter(dataset.col("animal").rlike(
            "(?i)\\b(" String.join("|", alist) ")\\b"
    ));

    dataset.show(5, false);

CodePudding user response：

Your issue is not with the \\Q or \\E pattern, but with the \\b word boundary pattern.

If I change your final rlike regexp by removing \\b pattern as follow:

dataset = dataset.filter(dataset.col("animal").rlike(
  "(?i)(" String.join("|", alist) ")"
));

I get the two lines:

 ------ --------------------- 
|row_id|animal               |
 ------ --------------------- 
|2     |Dog                  |
|3     |Cat (Type Not Stated)|
 ------ ---------------------