I have dataframe:
d1 = [({'the town': 1, 'County Council s': 2, 'email':5},2),
({'Mayor': 2, 'Indiana': 2}, 4),
({'Congress': 2, 'Justice': 2,'country': 2, 'veterans':1},6)
]
df1 = spark.createDataFrame(d1, ['dct', 'count'])
df1.show()
ignore_lst = ['County Council s', 'emal','Indiana']
filter_lst = ['Congress','town','Mayor', 'Indiana']
I want to write two functions:
first function filters keys for the dct
column that are not in the ignore_list
and the second function filters if the keys are in filter_lst
Thus there will be two columns that contain dictionaries with keys filtered by ignore_list
and filter_lst
CodePudding user response:
These two UDFs should be sufficient for your case:
from pyspark.sql.functions import col
d1 = [({'the town': 1, 'County Council s': 2, 'email':5},2),
({'Mayor': 2, 'Indiana': 2}, 4),
({'Congress': 2, 'Justice': 2,'country': 2, 'veterans':1},6)
]
ignore_lst = ['County Council s', 'emal','Indiana']
filter_lst = ['Congress','town','Mayor', 'Indiana']
df1 = spark.createDataFrame(d1, ['dct', 'count'])
@udf
def apply_ignore_lst(dct):
return {k:v for k, v in dct.items() if k not in ignore_lst}
@udf
def apply_filter_lst(dct):
return {k:v for k, v in dct.items() if k in filter_lst}
df1.withColumn("apply_ignore_lst", apply_ignore_lst(col("dct"))).withColumn("apply_filter_lst", apply_filter_lst(col("apply_ignore_lst"))).show(truncate=False)
---------------------------------------------------------- ----- ---------------------------------------------- ----------------
|dct |count|apply_ignore_lst |apply_filter_lst|
---------------------------------------------------------- ----- ---------------------------------------------- ----------------
|{the town -> 1, County Council s -> 2, email -> 5} |2 |{the town=1, email=5} |{} |
|{Indiana -> 2, Mayor -> 2} |4 |{Mayor=2} |{Mayor=2} |
|{Justice -> 2, Congress -> 2, country -> 2, veterans -> 1}|6 |{Congress=2, Justice=2, country=2, veterans=1}|{Congress=2} |
---------------------------------------------------------- ----- ---------------------------------------------- ----------------