I have a dataframe like the following:
rdd = spark.sparkContext.parallelize([
(22,'fl1.variant,fl2.variant,fl3.control','xxx','yyy','zzz'),
(23,'fl1.variant,fl2.neither,fl3.control','xxx','yyy','zzz'),
(24,'fl4.variant,fl2.variant,fl4.variant','xxx1','yyy1','zzz1'),
(25,'fl3.control,fl3.control,fl3.variant','xxx1','yyy1','zzz1')
])
df = rdd.toDF(['Date','Type','Data1','Data2','Data3'])
sample data :
Date | Type | Data1 | Data2 | Data3 |
---|---|---|---|---|
22 | fl1.variant,fl2.variant,fl3.control | xxx | yyy | zzz |
23 | fl1.variant,fl2.neither,fl3.control | xxx | yyy | zzz |
24 | fl4.variant,fl2.variant,fl4.variant | xxx1 | yyy1 | zzz1 |
25 | fl3.control,fl3.control,fl3.variant | xxx1 | yyy1 | zzz1 |
I need to identify the distinct values of Type
column data based on Data1
, Data2
and Data3
columns. Type
column's datatype is string separated by ,
.
Based on Data1
, Data2
, Data3
, I need to combine all the data and identify the unique values of Type
column.
expected output :
Data1 | Data2 | Data3 | Type_list |
---|---|---|---|
xxx | yyy | zzz | [fl1.variant,fl2.variant,fl3.control, fl2.neither] |
xxx1 | yyy1 | zzz1 | [fl4.variant,fl2.variant,fl3.control,fl3.variant] |
I tried like below but could not get the expected distinct values
df1 = df.sort("Data1","Data2","Data3","Type"). \
groupBy("Data1","Data2","Data3"). \
agg(func.collect_set("Type").cast(func.StringType())). \
withColumnRenamed("CAST(collect_set(Type) AS STRING)", "Type_list")
Data1 | Data2 | Data3 | Type_list |
---|---|---|---|
xxx | yyy | zzz | [fl1.variant,fl2.variant,fl3.control, fl1.variant, fl2.neither,fl3.control] |
xxx1 | yyy1 | zzz1 | [fl4.variant,fl2.variant,fl4.variant, fl3.control,fl3.control,fl3.variant] |
df2 = df1.select("Data1","Data2","Data3",func.array_distinct(func.split("Type_list" , ",")))
Data1 | Data2 | Data3 | array_distinct(split(Type_list, ,, -1)) |
---|---|---|---|
xxx | yyy | zzz | [[fl1.variant, fl2.variant, fl3.control, fl1.variant, fl2.neither, fl3.control]] |
xxx1 | yyy1 | zzz1 | [[fl4.variant, fl2.variant, fl4.variant, fl3.control, fl3.control, fl3.variant]] |
CodePudding user response:
first split
then array_distinct(flatten(collect_list()))
data_sdf. \
withColumn('type_split', func.split('type', ',')). \
groupBy('data1','data2','data3'). \
agg(func.array_distinct(func.flatten(func.collect_list('type_split'))).alias('type_list')). \
show(truncate=False)
# ----- ----- ----- ----------------------------------------------------
# |data1|data2|data3|type_list |
# ----- ----- ----- ----------------------------------------------------
# |xxx |yyy |zzz |[fl1.variant, fl2.variant, fl3.control, fl2.neither]|
# |xxx1 |yyy1 |zzz1 |[fl4.variant, fl2.variant, fl3.control, fl3.variant]|
# ----- ----- ----- ----------------------------------------------------