Home > database >  Split and extract substring from string
Split and extract substring from string

Time:07-29

I have a data frame like below in pyspark

df = spark.createDataFrame(
[
('14|100|00|123|567','A',25),
('13|200|20|323|467','B',24),
('15|300|30|223|667','A',20)],("rust", "name", "value"))

I want to achieve below

1) Extract substring from rust column between 1st and 2nd | as new column
2) Extract substring from rust column between 2nd and 3rd | as new column
3) Extract substring from rust column after 3rd | as new column

I have tried like below

df1 = df.select("*", f.substring_index(df.rust, '|', 2))
df1.show()
 ----------------- ---- ----- --------------------------- 
|             rust|name|value|substring_index(rust, |, 2)|
 ----------------- ---- ----- --------------------------- 
|14|100|00|123|567|   A|   25|                     14|100|
|13|200|20|323|467|   B|   24|                     13|200|
|15|300|30|223|667|   A|   20|                     15|300|
 ----------------- ---- ----- --------------------------- 

Expected_result is below

 ----------------- ---- ----- ------------- ------------ ----------- 
|             rust|name|value|second_string|third_string|last_string|
 ----------------- ---- ----- ------------- ------------ ----------- 
|14|100|00|123|567|   A|   25|          100|          00|    123|567|
|13|200|20|323|467|   B|   24|          200|          20|    323|467|
|15|300|30|223|667|   A|   20|          300|          30|    223|667|
 ----------------- ---- ----- ------------- ------------ ----------- 

How can I achieve what I want

CodePudding user response:

df = spark.createDataFrame(
[
('14|100|00|123|567','A',25),
('13|200|20|323|467','B',24),
('15|300|30|223|667','A',20)],("rust", "name", "value"))

(
    df
    .withColumn("rust_list", split(col('rust'), '\|')) # [14, 100, 00, 123, 567]
    .withColumn('second_string', col('rust_list')[1])
    .withColumn('third_string', col('rust_list')[2])
    .withColumn('last_string', concat_ws('|', col('rust_list')[3], col('rust_list')[4]))
    .drop('rust_list')
    .show(10, False)
)
#  ----------------- ---- ----- ------------- ------------ ----------- 
# |rust             |name|value|second_string|third_string|last_string|
#  ----------------- ---- ----- ------------- ------------ ----------- 
# |14|100|00|123|567|A   |25   |100          |00          |123|567    |
# |13|200|20|323|467|B   |24   |200          |20          |323|467    |
# |15|300|30|223|667|A   |20   |300          |30          |223|667    |
#  ----------------- ---- ----- ------------- ------------ ----------- 
  • Related