I have a column in my DF where data type is :
testcolumn:array
--element: struct
-----id:integer
-----configName: string
-----desc:string
-----configparam:array
--------element:map
-------------key:string
-------------value:string
testcolumn
Row1:
[{"id":1,"configName":"test1","desc":"Ram1","configparam":[{"removeit":"[]"}]},
{"id":2,"configName":"test2","desc":"Ram2","configparam":[{"removeit":"[]"}]},
{"id":3,"configName":"test3","desc":"Ram1","configparam":[{"paramId":"4","paramvalue":"200"}]}]
Row2:
[{"id":11,"configName":"test11","desc":"Ram11","configparam":[{"removeit":"[]"}]},
{"id":33,"configName":"test33","desc":"Ram33","configparam":[{"paramId":"43","paramvalue":"300"}]},
{"id":6,"configName":"test26","desc":"Ram26","configparam":[{"removeit":"[]"}]},
{"id":93,"configName":"test93","desc":"Ram93","configparam":[{"paramId":"93","paramvalue":"3009"}]}
]
I want to remove where configparam is "configparam":[{"removeit":"[]"}] to "configparam":[]
expecting output:
outputcolumn
Row1:
[{"id":1,"configName":"test1","desc":"Ram1","configparam":[]},
{"id":2,"configName":"test2","desc":"Ram2","configparam":[]},
{"id":3,"configName":"test3","desc":"Ram1","configparam":[{"paramId":"4","paramvalue":"200"}]}]
Row2:
[{"id":11,"configName":"test11","desc":"Ram11","configparam":[]},
{"id":33,"configName":"test33","desc":"Ram33","configparam":[{"paramId":"43","paramvalue":"300"}]},
{"id":6,"configName":"test26","desc":"Ram26","configparam":[]},
{"id":93,"configName":"test93","desc":"Ram93","configparam":[{"paramId":"93","paramvalue":"3009"}]}
]
I have tried this code but it is not giving me output:
test=df.withColumn('outputcolumn',F.expr("translate"(testcolumn,x-> replace(x,':[{"removeit":"[]"}]','[]')))
it will be really great if someone can help me.
CodePudding user response:
Your testcolumn
is an array of struct so you cannot do a string operation as it is.
You can do something like this. This will empty configparam
completely when it contains a key "removeit".
Example:
"configparam":[{"removeit":[], "otherparam": "value"}] -> "configparam": []
array_has_remove = lambda y: ~F.array_contains(F.map_keys(y), 'removeit')
df = (df.withColumn('outputcolumn',
F.transform('testcolumn',
lambda x: x.withField('configparam',
F.filter(x['configparam'], array_has_remove)
)
)
))
Ref: withField, filter, array_contains, map_keys
CodePudding user response:
You have to perform a chain of explode, filter and groupBy operations to achieve this.
First, explode array/struct/map columns to reach to the nested column:
df = df.withColumn("id", F.col("testcolumn")["id"])
df = df.withColumn("configName", F.col("testcolumn")["configName"])
df = df.withColumn("desc", F.col("testcolumn")["desc"])
df = df.withColumn("configparam_exploded", F.explode(F.col("testcolumn")["configparam"]))
df = df.select(df.columns [F.explode(F.col("configparam_exploded"))])
----------------------------------------------------- --- ---------- ---- --------------------------------- ---------- -----
|testcolumn |id |configName|desc|configparam_exploded |key |value|
----------------------------------------------------- --- ---------- ---- --------------------------------- ---------- -----
|{1, test1, Ram1, [{removeit -> []}]} |1 |test1 |Ram1|{removeit -> []} |removeit |[] |
|{2, test2, Ram2, [{removeit -> []}]} |2 |test2 |Ram2|{removeit -> []} |removeit |[] |
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3 |test3 |Ram1|{paramId -> 4, paramvalue -> 200}|paramId |4 |
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3 |test3 |Ram1|{paramId -> 4, paramvalue -> 200}|paramvalue|200 |
----------------------------------------------------- --- ---------- ---- --------------------------------- ---------- -----
Then, filter data as required:
df = df.filter((F.col("key") != "removeit") | (F.col("value") != "[]"))
----------------------------------------------------- --- ---------- ---- --------------------------------- ---------- -----
|testcolumn |id |configName|desc|configparam_exploded |key |value|
----------------------------------------------------- --- ---------- ---- --------------------------------- ---------- -----
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3 |test3 |Ram1|{paramId -> 4, paramvalue -> 200}|paramId |4 |
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3 |test3 |Ram1|{paramId -> 4, paramvalue -> 200}|paramvalue|200 |
----------------------------------------------------- --- ---------- ---- --------------------------------- ---------- -----
Finally, groupBy all individual columns back to original packing:
df = df.withColumn("configparam_map", F.map_from_entries(F.array(F.struct("key", "value"))))
df = df.groupBy(["id", "configName", "desc"]).agg(F.collect_list("configparam_map").alias("configparam"))
df = df.withColumn("testcolumn", F.struct("id", "configName", "desc", "configparam"))
df = df.drop("id", "configName", "desc", "configparam")
-------------------------------------------------------
|testcolumn |
-------------------------------------------------------
|{3, test3, Ram1, [{paramId -> 4}, {paramvalue -> 200}]}|
-------------------------------------------------------
Sample dataset used to reproduce the problem:
schema = StructType([StructField('testcolumn', StructType([StructField('id', IntegerType(), True), StructField('configName', StringType(), True), StructField('desc', StringType(), True), StructField('configparam', ArrayType(MapType(StringType(), StringType(), True), True), True)]), True)])
data = [
Row(Row(1, "test1", "Ram1", [{"removeit":"[]"}])),
Row(Row(2, "test2", "Ram2", [{"removeit":"[]"}])),
Row(Row(3, "test3", "Ram1", [{"paramId":"4","paramvalue":"200"}]))
]
df = spark.createDataFrame(data = data, schema = schema)