Home > database >  Distinct records form the string column using pyspark
Distinct records form the string column using pyspark

Time:11-05

I have a dataframe like the following:

rdd = spark.sparkContext.parallelize([
    (22,'fl1.variant,fl2.variant,fl3.control','xxx','yyy','zzz'),
    (23,'fl1.variant,fl2.neither,fl3.control','xxx','yyy','zzz'),
    (24,'fl4.variant,fl2.variant,fl4.variant','xxx1','yyy1','zzz1'),
    (25,'fl3.control,fl3.control,fl3.variant','xxx1','yyy1','zzz1')
])

df = rdd.toDF(['Date','Type','Data1','Data2','Data3'])

sample data :

Date Type Data1 Data2 Data3
22 fl1.variant,fl2.variant,fl3.control xxx yyy zzz
23 fl1.variant,fl2.neither,fl3.control xxx yyy zzz
24 fl4.variant,fl2.variant,fl4.variant xxx1 yyy1 zzz1
25 fl3.control,fl3.control,fl3.variant xxx1 yyy1 zzz1

I need to identify the distinct values of Type column data based on Data1, Data2 and Data3 columns. Type column's datatype is string separated by ,.

Based on Data1, Data2, Data3, I need to combine all the data and identify the unique values of Type column.

expected output :

Data1 Data2 Data3 Type_list
xxx yyy zzz [fl1.variant,fl2.variant,fl3.control, fl2.neither]
xxx1 yyy1 zzz1 [fl4.variant,fl2.variant,fl3.control,fl3.variant]

I tried like below but could not get the expected distinct values

df1 = df.sort("Data1","Data2","Data3","Type"). \
    groupBy("Data1","Data2","Data3"). \
    agg(func.collect_set("Type").cast(func.StringType())). \
    withColumnRenamed("CAST(collect_set(Type) AS STRING)", "Type_list")
Data1 Data2 Data3 Type_list
xxx yyy zzz [fl1.variant,fl2.variant,fl3.control, fl1.variant, fl2.neither,fl3.control]
xxx1 yyy1 zzz1 [fl4.variant,fl2.variant,fl4.variant, fl3.control,fl3.control,fl3.variant]
df2 = df1.select("Data1","Data2","Data3",func.array_distinct(func.split("Type_list" , ",")))
Data1 Data2 Data3 array_distinct(split(Type_list, ,, -1))
xxx yyy zzz [[fl1.variant, fl2.variant, fl3.control, fl1.variant, fl2.neither, fl3.control]]
xxx1 yyy1 zzz1 [[fl4.variant, fl2.variant, fl4.variant, fl3.control, fl3.control, fl3.variant]]

CodePudding user response:

first split then array_distinct(flatten(collect_list()))

data_sdf. \
    withColumn('type_split', func.split('type', ',')). \
    groupBy('data1','data2','data3'). \
    agg(func.array_distinct(func.flatten(func.collect_list('type_split'))).alias('type_list')). \
    show(truncate=False)

#  ----- ----- ----- ---------------------------------------------------- 
# |data1|data2|data3|type_list                                           |
#  ----- ----- ----- ---------------------------------------------------- 
# |xxx  |yyy  |zzz  |[fl1.variant, fl2.variant, fl3.control, fl2.neither]|
# |xxx1 |yyy1 |zzz1 |[fl4.variant, fl2.variant, fl3.control, fl3.variant]|
#  ----- ----- ----- ---------------------------------------------------- 
  • Related