I am new to pyspark. I have a pyspark dataframe as follows:
-----------
|C1| C2 | c3|
-----------
|A |0 | 1 |
|C |0 | 1 |
|A |1 | 0 |
|B |0 | 0 |
-----------
I also have another python dictionary as follows:
my_dict = {"A" : "5", "B" : "10"} # Not there is no entry with key 'C' here
What I would like to ensure is that my dataframe keeps only those rows whose C1 column's values are present as keys inside the dictionary my_dict. The output should be somewhat like this:
-----------
|C1| C2 | c3|
-----------
|A |0 | 1 |
|A |1 | 0 |
|B |0 | 0 |
-----------
Edit : The C1 column entries are a little more complicated than depicted above. Although it is a string, it has quite a few special characters. Something like this:
A : www.A.com || u-a : Mozilla/5.0 (iPhone; CPU iPhone OS 8_4_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12H321 [FBAN/FBIOS;FBAV/163.0.0.54.96;FBBV/96876057;FBDV/iPhone7,1;FBMD/iPhone;FBSN/iPhone OS;FBSV/8.4.1;FBSS/3;FBCR/MEO;FBID/phone;FBLC/pt_PT;FBOP/5;FBRV/98697066] || C : none || accept-encoding : gzip, deflate, br || accept-language : en-US,en;q=0.9=223
The string above is used as a key in the dictionary as well.
CodePudding user response:
you can try using below syntax
input_data=[['A',0,1],['C',0,1],['A',1,0],['B',0,0]]
my_dict = {"A" : "5", "B" : "10"}
data = spark.createDataFrame(input_data)
input_key_list=[key for key in my_dict.keys()]
from pyspark.sql.functions import col
data.where(col("_1").isin(input_key_list)).show()
another approach could be -
input_data=[['A',0,1],['C',0,1],['A',1,0],['B',0,0]]
input_data_columns = ["c1","c2","c3"]
my_dict = {"A" : "5", "B" : "10"}
input_key_list=[key for key in my_dict.keys()]
from pyspark.sql.types import IntegerType,StringType
keys_data=spark.createDataFrame(input_key_list, StringType())
data = spark.createDataFrame(input_data,schema=input_data_columns)
keys_data.join(data,data.c1 == keys_data.value,"inner").select("c1","c2","c3").show(truncate=False)