In pyspark, how do I to filter a dataframe that has a column that is a list of dictionaries, based on a specific dictionary key value?
------------------------------------ ---------------
|foo_dic_list |text |
------------------------------------ ---------------
|[{'1': [1,2,3],'4': [2,3,4]}] |teacher |
|[{'2': [5,2,3] }] |student |
|[{'4': [2,2,2]}] |gamer |
|[{'3': [3,3,3]}] |robot |
------------------------------------ ---------------
I want to select rows like below, which contains "4" in keys of foo_dic_list column.
------------------------------------ ---------------
|foo_dic_list |text |
------------------------------------ ---------------
|[{'1': [1,2,3],'4': [2,3,4]}] |teacher |
|[{'4': [2,2,2]}] |gamer |
------------------------------------ ---------------
CodePudding user response:
choose the easy way: using locate
like this. And then filter location > 0
d1 = [
("[{'1': [1,2,3],'4': [2,3,4]}]", "teacher"),
("[{'2': [5,2,3] }]", "student"),
("[{'4': [2,2,2]}]", "gamer"),
("[{'3': [3,3,3]}]", "robot"),
]
df1 = spark.createDataFrame(d1, ['foo_dic_list', 'text'])
df1.printSchema()
# root
# |-- foo_dic_list: string (nullable = true)
# |-- text: string (nullable = true)
df1.withColumn('location', locate("\'4\':", col('foo_dic_list'))).show(10, False)
# ----------------------------- ------- --------
# |foo_dic_list |text |location|
# ----------------------------- ------- --------
# |[{'1': [1,2,3],'4': [2,3,4]}]|teacher|16 |
# |[{'2': [5,2,3] }] |student|0 |
# |[{'4': [2,2,2]}] |gamer |3 |
# |[{'3': [3,3,3]}] |robot |0 |
# ----------------------------- ------- --------
CodePudding user response:
This may not be the best way, but we can use an UDF to get the list of keys and then use array_contains()
on that to filter. The below only works if there's just one dictionary within the array.
data_ls = [
(['''{'1': [1,2,3],'4': [2,3,4]}'''], 'teacher'),
(['''{'2': [5,2,3] }'''], 'student'),
(['''{'4': [2,2,2]}'''], 'gamer')
]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['foo_dic_list', 'text'])
# ----------------------------- -------
# |foo_dic_list |text |
# ----------------------------- -------
# |[{'1': [1,2,3],'4': [2,3,4]}]|teacher|
# |[{'2': [5,2,3] }] |student|
# |[{'4': [2,2,2]}] |gamer |
# ----------------------------- -------
# root
# |-- foo_dic_list: array (nullable = true)
# | |-- element: string (containsNull = true)
# |-- text: string (nullable = true)
Create a function to parse it as json string resulting in a dictionary. Then fetch the list of keys using dict.keys()
.
def getDictKeys(json_str):
import json
json_dict = json.loads(json_str.replace("\'", '\"'))
json_dict_keys = list(json_dict.keys())
return json_dict_keys
getDictKeys_udf = func.udf(getDictKeys, ArrayType(StringType()))
data_sdf. \
withColumn('arr_element', func.col('foo_dic_list').getItem(0)). \
withColumn('keys_arr', getDictKeys_udf(func.col('arr_element'))). \
filter(func.array_contains('keys_arr', '4')). \
select('foo_dic_list', 'text'). \
show(truncate=False)
# ----------------------------- -------
# |foo_dic_list |text |
# ----------------------------- -------
# |[{'1': [1,2,3],'4': [2,3,4]}]|teacher|
# |[{'4': [2,2,2]}] |gamer |
# ----------------------------- -------
The keys_arr
field looks like the following
data_sdf. \
withColumn('arr_element', func.col('foo_dic_list').getItem(0)). \
withColumn('keys_arr', getDictKeys_udf(func.col('arr_element'))). \
show(truncate=False)
# ----------------------------- ------- --------------------------- --------
# |foo_dic_list |text |arr_element |keys_arr|
# ----------------------------- ------- --------------------------- --------
# |[{'1': [1,2,3],'4': [2,3,4]}]|teacher|{'1': [1,2,3],'4': [2,3,4]}|[1, 4] |
# |[{'2': [5,2,3] }] |student|{'2': [5,2,3] } |[2] |
# |[{'4': [2,2,2]}] |gamer |{'4': [2,2,2]} |[4] |
# ----------------------------- ------- --------------------------- --------
# root
# |-- foo_dic_list: array (nullable = true)
# | |-- element: string (containsNull = true)
# |-- text: string (nullable = true)
# |-- arr_element: string (nullable = true)
# |-- keys_arr: array (nullable = true)
# | |-- element: string (containsNull = true)
The like()
operator also kinda worked with my sample.
data_sdf. \
filter(func.col('foo_dic_list').getItem(0).like("%'4':%")). \
show(truncate=False)
# ----------------------------- -------
# |foo_dic_list |text |
# ----------------------------- -------
# |[{'1': [1,2,3],'4': [2,3,4]}]|teacher|
# |[{'4': [2,2,2]}] |gamer |
# ----------------------------- -------