Home > Enterprise >  Split string column into multiple columns in pyspark
Split string column into multiple columns in pyspark

Time:01-24

I have a table as below -

 ------ --------------------------------- --------------- ------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|cus_id|cus_nm                           |pur_region     |purchase_dt  |pur_details                                                                                                                                                                                                                            |
 ------ --------------------------------- --------------- ------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|0121  |Johnny                           |USA            |2023-01-12   |[{product_id=XA8096521JKAZ42F123, product_name=luxury_watch_collection_rolex_GZ, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0137  |Kevin J Brown                    |USA            |2022-05-31   |[{product_id=XA14567JKR700135126, product_name=luxury_watch_collection_rolex_LA, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0168  |Patrikson                        |UK             |2022-11-08   |[{product_id=XAHJYZK906423623571, product_name=luxury_watch_collection_gucci_09, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0365  |Ryan Ray                         |USA            |2021-10-12   |[{product_id=XAOPLKR7520HJV00109, product_name=luxury_watch_collection_vancleef, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|2600  |Jay                              |AUS            |2022-11-11   |[{product_id=XA096534987GGHJLRAC, product_name=sports_eyewear, description=athlete sports sun glasses, check=sale_item, sale_price_gap=BOGO 20% off, sale_vendor=mrporter.com, action=report}]                                         |
 ------ --------------------------------- --------------- ------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 

The schema of this table is -

root
 |-- cus_id: string (nullable = true)
 |-- cus_nm: string (nullable = true)
 |-- pur_region: string (nullable = true)
 |-- purchase_dt: string (nullable = true)
 |-- pur_details: string (nullable = true)

I would like to split the column pur_details and extract check and sale_price_gap as separate columns. Note that the pur_details may or may not have check and sale_price_gap, so if it's not present in pur_details then the new column values should be null.

Sample expected output -

 ------ --------------------------------- --------------- ------------- ---------- --------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|cus_id|cus_nm                           |pur_region     |purchase_dt  |check     |sale_price_gap       |pur_details                                                                                                                                                                                                                            |
 ------ --------------------------------- --------------- ------------- ---------- --------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
|0121  |Johnny                           |USA            |2023-01-12   |sale_item |upto 30% on_sale     |[{product_id=XA8096521JKAZ42F123, product_name=luxury_watch_collection_rolex_GZ, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0137  |Kevin J Brown                    |USA            |2022-05-31   |sale_item |upto 30% on_sale     |[{product_id=XA14567JKR700135126, product_name=luxury_watch_collection_rolex_LA, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0168  |Patrikson                        |UK             |2022-11-08   |sale_item |upto 30% on_sale     |[{product_id=XAHJYZK906423623571, product_name=luxury_watch_collection_gucci_09, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|0365  |Ryan Ray                         |USA            |2021-10-12   |sale_item |upto 30% on_sale     |[{product_id=XAOPLKR7520HJV00109, product_name=luxury_watch_collection_vancleef, description=mens watch round dail on sale, check=sale_item, tag=watch, sale_price_gap=upto 30% on_sale, sale_vendor=mrporter.com, action=entry}]      |
|2600  |Jay                              |AUS            |2022-11-11   |sale_item |BOGO 20% off         |[{product_id=XA096534987GGHJLRAC, product_name=sports_eyewear, description=athlete sports sun glasses, check=sale_item, sale_price_gap=BOGO 20% off, sale_vendor=mrporter.com, action=report}]                                         |
 ------ --------------------------------- --------------- ------------- ---------- --------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 

Can someone please help me with the best and efficient way when I have millions of rows.

Thanks in advance.

CodePudding user response:

With a few assumptions like there is no , on strings except those separate fields, I hope this code helps you to get an idea to achieve what you want:

df.withColumn("pur_details_split", split(col("pur_details"), ","))
    .withColumn("check", element_at(split(element_at(filter(col("pur_details_split"), x => trim(x).startsWith("check")), 1), "="), 2))
    .withColumn("sale_price_gap", element_at(split(element_at(filter(col("pur_details_split"), x => trim(x).startsWith("sale_price_gap")), 1), "="), 2))
    .select("check", "sale_price_gap")
    .show(false)

 --------- ---------------- 
|check    |sale_price_gap  |
 --------- ---------------- 
|sale_item|upto 30% on_sale|
|sale_item|upto 30% on_sale|
|sale_item|upto 30% on_sale|
|sale_item|upto 30% on_sale|
|sale_item|null            |
 --------- ---------------- 


Ps 1. I can't remember the python equivalencies of Spark API, but I'm pretty sure they are similar to Scala ones.

Ps 2. I removed sale_price_gap from the last record on the original dataframe to support the scenario of non-existent value.

  • Related