I got the following table column named data with the following structure:
{
"data":[
{
"id":"test_111",
"sum":"10"
},
{
"id":"skip",
"sum":"74"
}
]
}
I want to extract all records that have at least 1 record in data object that has id which starts with test_{d} (where {d} is digits). Meaning that if i have another data record with the this value
{
"data":[
{
"id":"skip_this_too",
"sum":"123"
},
{
"id":"skip",
"sum":"11"
}
]
}
I want to skip it. I have tried to use scala's array_contains function, but it cannot be implemented in multidimensional array. Also, if i try this condition:
WHERE data.data[0].id LIKE 'test_%'
It works, but i'm not sure how to implement the same condition to all of data elements. Any help will be appreciated!
CodePudding user response:
Consider using higher-order function exists
along with rlike
for what you need, as shown in the following example:
val json = """[
{
"row_id": 1,
"data": [
{ "id": "test_111", "sum": "10" },
{ "id": "skip1", "sum": "74" }
]
},
{
"row_id": 2,
"data": [
{ "id": "test_XYZ", "sum": "20" },
{ "id": "skip2", "sum": "85" }
]
},
{
"row_id": 3,
"data": [
{ "id": "skip3", "sum": "96" },
{ "id": "test_2222", "sum": "30" }
]
}
]"""
val df = spark.read.option("multiline", "true").json(Seq(json).toDS)
df.show(false)
/*
------------------------------ ------
|data |row_id|
------------------------------ ------
|[{test_111, 10}, {skip1, 74}] |1 |
|[{test_XYZ, 20}, {skip2, 85}] |2 |
|[{skip3, 96}, {test_2222, 30}]|3 |
------------------------------ ------
*/
To filter the dataframe based on existence of any data.id
with the test_[0-9]
pattern:
df.
where(expr("exists(data, d -> d.id rlike 'test_[0-9] ')")).
show(false)
/*
------------------------------ ------
|data |row_id|
------------------------------ ------
|[{test_111, 10}, {skip1, 74}] |1 |
|[{skip3, 96}, {test_2222, 30}]|3 |
------------------------------ ------
*/