Search in multidimensional array using regex in in Spark-CodePudding

I got the following table column named data with the following structure:

{
  "data":[
      {
        "id":"test_111",
        "sum":"10"
      },
      {
        "id":"skip",
        "sum":"74"
       }
    ]
  }

I want to extract all records that have at least 1 record in data object that has id which starts with test_{d} (where {d} is digits). Meaning that if i have another data record with the this value

 {
   "data":[
       {
          "id":"skip_this_too",
          "sum":"123"
        },
        {
          "id":"skip",
          "sum":"11"
         }
     ]
 }

I want to skip it. I have tried to use scala's array_contains function, but it cannot be implemented in multidimensional array. Also, if i try this condition:

WHERE data.data[0].id LIKE 'test_%'

It works, but i'm not sure how to implement the same condition to all of data elements. Any help will be appreciated!

CodePudding user response：

Consider using higher-order function exists along with rlike for what you need, as shown in the following example:

val json = """[
    {
        "row_id": 1,
        "data": [
            { "id": "test_111", "sum": "10" },
            { "id": "skip1", "sum": "74" }
        ]
    },
    {
        "row_id": 2,
        "data": [
            { "id": "test_XYZ", "sum": "20" },
            { "id": "skip2", "sum": "85" }
        ]
    },
    {
        "row_id": 3,
        "data": [
            { "id": "skip3", "sum": "96" },
            { "id": "test_2222", "sum": "30" }
        ]
    } 
  ]"""

val df = spark.read.option("multiline", "true").json(Seq(json).toDS)

df.show(false)
/*
 ------------------------------ ------ 
|data                          |row_id|
 ------------------------------ ------ 
|[{test_111, 10}, {skip1, 74}] |1     |
|[{test_XYZ, 20}, {skip2, 85}] |2     |
|[{skip3, 96}, {test_2222, 30}]|3     |
 ------------------------------ ------ 
*/

To filter the dataframe based on existence of any data.id with the test_[0-9] pattern:

df.
  where(expr("exists(data, d -> d.id rlike 'test_[0-9] ')")).
  show(false)
/*
 ------------------------------ ------ 
|data                          |row_id|
 ------------------------------ ------ 
|[{test_111, 10}, {skip1, 74}] |1     |
|[{skip3, 96}, {test_2222, 30}]|3     |
 ------------------------------ ------ 
*/