Home > database >  extract substring before first occurrence and substring after last occurrence of a delimiter in Pysp
extract substring before first occurrence and substring after last occurrence of a delimiter in Pysp

Time:02-16

I have a data frame like below in pyspark

df = spark.createDataFrame(
[
('14_100_00','A',25),
('13_100_00','B',24),
('15_100_00','A',20),
('150_100','C',21),
('16','A',20),
('1634_100_00_01','B',22),
('1_100_00','C',23),
('18_100_00','D',24)],("rust", "name", "value"))

df.show()
 -------------- ---- ----- 
|          rust|name|value|
 -------------- ---- ----- 
|     14_100_00|   A|   25|
|     13_100_00|   B|   24|
|     15_100_00|   A|   20|
|       150_100|   C|   21|
|            16|   A|   20|
|1634_100_00_01|   B|   22|
|      1_100_00|   C|   23|
|     18_100_00|   D|   24|
 -------------- ---- ----- 

I am trying to create a new column using the rust column using below conditions

1) extract anything before 1st underscore
2) extract anything after the last underscore
3) concatenate the above two values using tilda(~)

If no underscores in the column then have column as is

I have tried like below

from pyspark.sql.functions import substring_index
df1 = df.select("*", f.concat(f.substring_index(df.rust, '_', 1), f.lit('~'), f.substring_index(df.rust, '_', -1)).alias("extract"))
df1.show()
 -------------- ---- ----- ------- 
|          rust|name|value|extract|
 -------------- ---- ----- ------- 
|     14_100_00|   A|   25|  14~00|
|     13_100_00|   B|   24|  13~00|
|     15_100_00|   A|   20|  15~00|
|       150_100|   C|   21|150~100|
|            16|   A|   20|  16~16|
|1634_100_00_01|   B|   22|1634~01|
|      1_100_00|   C|   23|   1~00|
|     18_100_00|   D|   24|  18~00|
 -------------- ---- ----- ------- 

expected result:

 -------------- ---- ----- ------- 
|          rust|name|value|extract|
 -------------- ---- ----- ------- 
|     14_100_00|   A|   25|  14~00|
|     13_100_00|   B|   24|  13~00|
|     15_100_00|   A|   20|  15~00|
|       150_100|   C|   21|150~100|
|            16|   A|   20|     16|
|1634_100_00_01|   B|   22|1634~01|
|      1_100_00|   C|   23|   1~00|
|     18_100_00|   D|   24|  18~00|
 -------------- ---- ----- ------- 

How can I achieve what I want

CodePudding user response:

Use the instr function to determine whether the rust column contains _, and then use the when function to process.

df1 = df.select("*",
                f.when(f.instr(df.rust, '_') > 0,
                       f.concat(f.substring_index(df.rust, '_', 1), f.lit('~'), f.substring_index(df.rust, '_', -1))
                       )
                .otherwise(df.rust)
                .alias("extract")
                )
  • Related