I have a table as below -
------ --------------------------------- --------------- ------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|cus_id|cus_nm |pur_region |purchase_dt |pur_details |
------ --------------------------------- --------------- ------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|0121 |Johnny |USA |2023-01-12 |[{product_id=XA8096521JKAZ42F123, product_name=luxury_watch_collection_rolex_GZ, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0137 |Kevin J Brown |USA |2022-05-31 |[{product_id=XA14567JKR700135126, product_name=luxury_watch_collection_rolex_LA, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0168 |Patrikson |UK |2022-11-08 |[{product_id=XAHJYZK906423623571, product_name=luxury_watch_collection_gucci_09, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0365 |Ryan Ray |USA |2021-10-12 |[{product_id=XAOPLKR7520HJV00109, product_name=luxury_watch_collection_vancleef, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|2600 |Jay |AUS |2022-11-11 |[{product_id=XA096534987GGHJLRAC, product_name=sports_eyewear, description=athlete sports sun glasses, check=sale_item, sale_price_gap=BOGO 20% off, sale_vendor=mrporter.com, action=report}] |
------ --------------------------------- --------------- ------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The schema of this table is -
root
|-- cus_id: string (nullable = true)
|-- cus_nm: string (nullable = true)
|-- pur_region: string (nullable = true)
|-- purchase_dt: string (nullable = true)
|-- pur_details: string (nullable = true)
I would like to split the column pur_details
and extract check
and sale_price_gap
as separate columns.
Note that the pur_details
may or may not have check
and sale_price_gap
, so if it's not present in pur_details
then the new column values should be null.
Sample expected output -
------ --------------------------------- --------------- ------------- ---------- --------------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|cus_id|cus_nm |pur_region |purchase_dt |check |sale_price_gap |pur_details |
------ --------------------------------- --------------- ------------- ---------- --------------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|0121 |Johnny |USA |2023-01-12 |sale_item |upto 30% on_sale |[{product_id=XA8096521JKAZ42F123, product_name=luxury_watch_collection_rolex_GZ, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0137 |Kevin J Brown |USA |2022-05-31 |sale_item |upto 30% on_sale |[{product_id=XA14567JKR700135126, product_name=luxury_watch_collection_rolex_LA, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0168 |Patrikson |UK |2022-11-08 |sale_item |upto 30% on_sale |[{product_id=XAHJYZK906423623571, product_name=luxury_watch_collection_gucci_09, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|0365 |Ryan Ray |USA |2021-10-12 |sale_item |upto 30% on_sale |[{product_id=XAOPLKR7520HJV00109, product_name=luxury_watch_collection_vancleef, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}] |
|2600 |Jay |AUS |2022-11-11 |sale_item |BOGO 20% off |[{product_id=XA096534987GGHJLRAC, product_name=sports_eyewear, description=athlete sports sun glasses, check=sale_item, sale_price_gap=BOGO 20% off, sale_vendor=mrporter.com, action=report}] |
------ --------------------------------- --------------- ------------- ---------- --------------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Can someone please help me with the best and efficient way when I have millions of rows.
Thanks in advance.
CodePudding user response:
With a few assumptions like there is no ,
on strings except those separate fields, I hope this code helps you to get an idea to achieve what you want:
df.withColumn("pur_details_split", split(col("pur_details"), ","))
.withColumn("check", element_at(split(element_at(filter(col("pur_details_split"), x => trim(x).startsWith("check")), 1), "="), 2))
.withColumn("sale_price_gap", element_at(split(element_at(filter(col("pur_details_split"), x => trim(x).startsWith("sale_price_gap")), 1), "="), 2))
.select("check", "sale_price_gap")
.show(false)
--------- ----------------
|check |sale_price_gap |
--------- ----------------
|sale_item|upto 30% on_sale|
|sale_item|upto 30% on_sale|
|sale_item|upto 30% on_sale|
|sale_item|upto 30% on_sale|
|sale_item|null |
--------- ----------------
Ps 1. I can't remember the python equivalencies of Spark API, but I'm pretty sure they are similar to Scala ones.
Ps 2. I removed sale_price_gap
from the last record on the original dataframe to support the scenario of non-existent value.