Filter dataframe by key in a list of dictionaries in pyspark-CodePudding

In pyspark, how do I to filter a dataframe that has a column that is a list of dictionaries, based on a specific dictionary key value?

 ------------------------------------ --------------- 
|foo_dic_list                        |text           |
 ------------------------------------ --------------- 
|[{'1': [1,2,3],'4': [2,3,4]}]       |teacher        |
|[{'2': [5,2,3] }]                   |student        |
|[{'4': [2,2,2]}]                    |gamer          |
|[{'3': [3,3,3]}]                    |robot          | 
 ------------------------------------ ---------------

I want to select rows like below, which contains "4" in keys of foo_dic_list column.

 ------------------------------------ --------------- 
|foo_dic_list                        |text           |
 ------------------------------------ --------------- 
|[{'1': [1,2,3],'4': [2,3,4]}]       |teacher        |
|[{'4': [2,2,2]}]                    |gamer          |
 ------------------------------------ ---------------

CodePudding user response：

choose the easy way: using locate like this. And then filter location > 0

d1 = [
    ("[{'1': [1,2,3],'4': [2,3,4]}]", "teacher"),
    ("[{'2': [5,2,3] }]",             "student"),
    ("[{'4': [2,2,2]}]",              "gamer"),
    ("[{'3': [3,3,3]}]",              "robot"),
]

df1 = spark.createDataFrame(d1, ['foo_dic_list', 'text'])
df1.printSchema()
# root
#  |-- foo_dic_list: string (nullable = true)
#  |-- text: string (nullable = true)
df1.withColumn('location', locate("\'4\':", col('foo_dic_list'))).show(10, False)
#  ----------------------------- ------- -------- 
# |foo_dic_list                 |text   |location|
#  ----------------------------- ------- -------- 
# |[{'1': [1,2,3],'4': [2,3,4]}]|teacher|16      |
# |[{'2': [5,2,3] }]            |student|0       |
# |[{'4': [2,2,2]}]             |gamer  |3       |
# |[{'3': [3,3,3]}]             |robot  |0       |
#  ----------------------------- ------- --------

CodePudding user response：

This may not be the best way, but we can use an UDF to get the list of keys and then use array_contains() on that to filter. The below only works if there's just one dictionary within the array.

data_ls = [
    (['''{'1': [1,2,3],'4': [2,3,4]}'''], 'teacher'),
    (['''{'2': [5,2,3] }'''], 'student'),
    (['''{'4': [2,2,2]}'''], 'gamer')
]

data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['foo_dic_list', 'text'])

#  ----------------------------- ------- 
# |foo_dic_list                 |text   |
#  ----------------------------- ------- 
# |[{'1': [1,2,3],'4': [2,3,4]}]|teacher|
# |[{'2': [5,2,3] }]            |student|
# |[{'4': [2,2,2]}]             |gamer  |
#  ----------------------------- ------- 

# root
#  |-- foo_dic_list: array (nullable = true)
#  |    |-- element: string (containsNull = true)
#  |-- text: string (nullable = true)

Create a function to parse it as json string resulting in a dictionary. Then fetch the list of keys using dict.keys().

def getDictKeys(json_str):
    import json

    json_dict = json.loads(json_str.replace("\'", '\"'))
    json_dict_keys = list(json_dict.keys())

    return json_dict_keys

getDictKeys_udf = func.udf(getDictKeys, ArrayType(StringType()))

data_sdf. \
    withColumn('arr_element', func.col('foo_dic_list').getItem(0)). \
    withColumn('keys_arr', getDictKeys_udf(func.col('arr_element'))). \
    filter(func.array_contains('keys_arr', '4')). \
    select('foo_dic_list', 'text'). \
    show(truncate=False)

#  ----------------------------- ------- 
# |foo_dic_list                 |text   |
#  ----------------------------- ------- 
# |[{'1': [1,2,3],'4': [2,3,4]}]|teacher|
# |[{'4': [2,2,2]}]             |gamer  |
#  ----------------------------- -------

The keys_arr field looks like the following

data_sdf. \
    withColumn('arr_element', func.col('foo_dic_list').getItem(0)). \
    withColumn('keys_arr', getDictKeys_udf(func.col('arr_element'))). \
    show(truncate=False)

#  ----------------------------- ------- --------------------------- -------- 
# |foo_dic_list                 |text   |arr_element                |keys_arr|
#  ----------------------------- ------- --------------------------- -------- 
# |[{'1': [1,2,3],'4': [2,3,4]}]|teacher|{'1': [1,2,3],'4': [2,3,4]}|[1, 4]  |
# |[{'2': [5,2,3] }]            |student|{'2': [5,2,3] }            |[2]     |
# |[{'4': [2,2,2]}]             |gamer  |{'4': [2,2,2]}             |[4]     |
#  ----------------------------- ------- --------------------------- -------- 

# root
#  |-- foo_dic_list: array (nullable = true)
#  |    |-- element: string (containsNull = true)
#  |-- text: string (nullable = true)
#  |-- arr_element: string (nullable = true)
#  |-- keys_arr: array (nullable = true)
#  |    |-- element: string (containsNull = true)

The like() operator also kinda worked with my sample.

data_sdf. \
    filter(func.col('foo_dic_list').getItem(0).like("%'4':%")). \
    show(truncate=False)

#  ----------------------------- ------- 
# |foo_dic_list                 |text   |
#  ----------------------------- ------- 
# |[{'1': [1,2,3],'4': [2,3,4]}]|teacher|
# |[{'4': [2,2,2]}]             |gamer  |
#  ----------------------------- -------