I have the query bellow:
finvResultOutput1
.withColumn("offersAtSite",when(col("activeIOsAtSite")==="Y",
finvResultOutput1.select(col("OfferSpecification_displayLabel"),col("Site_siteId").alias("siteid"))
.where(col("Site_siteId")===col("siteid"))
.drop("siteid")
.collect()
))
I want to add a new column to my data frame called offersAtSite
, this column should be added when another column called activeIOsAtSite
equals "Y"
.
The value of this new column should be a list without duplicates of the elements of column OfferSpecification_displayLabel
where the Site_siteId
equals the current Site_siteId
of the row we are iterating over.
Could anyone give me a better approach to do that, because the query that I made is taking too much time and I don't know if it is working
Bellow is an example of what I want to achieve:
CodePudding user response:
First, you can groupBy
then collect_set
(set does not contain duplicates) on your main table:
val grouped = df.groupBy("Site_siteId").agg(collect_set("OfferSpecifications_displayLabel").as("offerAtSite"))
We get:
----------- ------------------------
|Site_siteId|offerAtSite |
----------- ------------------------
|site_id_3 |[site3_DL_1, site3_DL_2]|
|site_id_4 |[site4_DL_1] |
|site_id_2 |[site2_DL_1] |
|site_id_1 |[site1_DL_1, site1_DL_2]|
----------- ------------------------
Then, we join our df
table with grouped
and overwrite offerAtSite
to only have values for Y
value:
df.join(grouped, Seq("Site_siteId"), "left")
.withColumn("offerAtSite", when(col("activeIOsAtSite").equalTo("Y"), col("offerAtSite")))
Final result:
----------- --------------- -------------------------------- ------------------------
|Site_siteId|activeIOsAtSite|OfferSpecifications_displayLabel|offerAtSite |
----------- --------------- -------------------------------- ------------------------
|site_id_3 |Y |site3_DL_1 |[site3_DL_1, site3_DL_2]|
|site_id_3 |Y |site3_DL_2 |[site3_DL_1, site3_DL_2]|
|site_id_4 |N |site4_DL_1 |null |
|site_id_2 |N |site2_DL_1 |null |
|site_id_1 |Y |site1_DL_1 |[site1_DL_1, site1_DL_2]|
|site_id_1 |Y |site1_DL_2 |[site1_DL_1, site1_DL_2]|
----------- --------------- -------------------------------- ------------------------
This should work better, good luck!