Home > Net >  Remove those rows from a pyspark dataframe whose entries from a column are not present in a dictiona
Remove those rows from a pyspark dataframe whose entries from a column are not present in a dictiona

Time:04-12

I am new to pyspark. I have a pyspark dataframe as follows:

 ----------- 
|C1| C2 | c3| 
 ----------- 
|A |0  | 1  |
|C |0  | 1  |
|A |1  | 0  |
|B |0  | 0  |
 ----------- 

I also have another python dictionary as follows:

my_dict = {"A" : "5", "B" : "10"} # Not there is no entry with key 'C' here

What I would like to ensure is that my dataframe keeps only those rows whose C1 column's values are present as keys inside the dictionary my_dict. The output should be somewhat like this:

 ----------- 
|C1| C2 | c3| 
 ----------- 
|A |0  | 1  |
|A |1  | 0  |
|B |0  | 0  |
 ----------- 

Edit : The C1 column entries are a little more complicated than depicted above. Although it is a string, it has quite a few special characters. Something like this:

A : www.A.com || u-a : Mozilla/5.0 (iPhone; CPU iPhone OS 8_4_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12H321 [FBAN/FBIOS;FBAV/163.0.0.54.96;FBBV/96876057;FBDV/iPhone7,1;FBMD/iPhone;FBSN/iPhone OS;FBSV/8.4.1;FBSS/3;FBCR/MEO;FBID/phone;FBLC/pt_PT;FBOP/5;FBRV/98697066] || C : none || accept-encoding : gzip, deflate, br || accept-language : en-US,en;q=0.9=223

The string above is used as a key in the dictionary as well.

CodePudding user response:

you can try using below syntax

input_data=[['A',0,1],['C',0,1],['A',1,0],['B',0,0]]
my_dict = {"A" : "5", "B" : "10"}
data = spark.createDataFrame(input_data)
input_key_list=[key for key in my_dict.keys()]

from pyspark.sql.functions import col
data.where(col("_1").isin(input_key_list)).show()

another approach could be -

input_data=[['A',0,1],['C',0,1],['A',1,0],['B',0,0]]
input_data_columns = ["c1","c2","c3"]
my_dict = {"A" : "5", "B" : "10"}
input_key_list=[key for key in my_dict.keys()]

from pyspark.sql.types import IntegerType,StringType
keys_data=spark.createDataFrame(input_key_list, StringType())
data = spark.createDataFrame(input_data,schema=input_data_columns)
keys_data.join(data,data.c1 ==  keys_data.value,"inner").select("c1","c2","c3").show(truncate=False)
  • Related