I have a data frame like below in pyspark
df = spark.createDataFrame(
[
('14_100_00','A',25),
('13_100_00','B',24),
('15_100_00','A',20),
('150_100','C',21),
('16','A',20),
('1634_100_00_01','B',22),
('1_100_00','C',23),
('18_100_00','D',24)],("rust", "name", "value"))
df.show()
-------------- ---- -----
| rust|name|value|
-------------- ---- -----
| 14_100_00| A| 25|
| 13_100_00| B| 24|
| 15_100_00| A| 20|
| 150_100| C| 21|
| 16| A| 20|
|1634_100_00_01| B| 22|
| 1_100_00| C| 23|
| 18_100_00| D| 24|
-------------- ---- -----
I am trying to create a new column using the rust
column using below conditions
1) extract anything before 1st underscore
2) extract anything after the last underscore
3) concatenate the above two values using tilda(~)
If no underscores in the column then have column as is
I have tried like below
from pyspark.sql.functions import substring_index
df1 = df.select("*", f.concat(f.substring_index(df.rust, '_', 1), f.lit('~'), f.substring_index(df.rust, '_', -1)).alias("extract"))
df1.show()
-------------- ---- ----- -------
| rust|name|value|extract|
-------------- ---- ----- -------
| 14_100_00| A| 25| 14~00|
| 13_100_00| B| 24| 13~00|
| 15_100_00| A| 20| 15~00|
| 150_100| C| 21|150~100|
| 16| A| 20| 16~16|
|1634_100_00_01| B| 22|1634~01|
| 1_100_00| C| 23| 1~00|
| 18_100_00| D| 24| 18~00|
-------------- ---- ----- -------
expected result:
-------------- ---- ----- -------
| rust|name|value|extract|
-------------- ---- ----- -------
| 14_100_00| A| 25| 14~00|
| 13_100_00| B| 24| 13~00|
| 15_100_00| A| 20| 15~00|
| 150_100| C| 21|150~100|
| 16| A| 20| 16|
|1634_100_00_01| B| 22|1634~01|
| 1_100_00| C| 23| 1~00|
| 18_100_00| D| 24| 18~00|
-------------- ---- ----- -------
How can I achieve what I want
CodePudding user response:
Use the instr
function to determine whether the rust
column contains _
, and then use the when
function to process.
df1 = df.select("*",
f.when(f.instr(df.rust, '_') > 0,
f.concat(f.substring_index(df.rust, '_', 1), f.lit('~'), f.substring_index(df.rust, '_', -1))
)
.otherwise(df.rust)
.alias("extract")
)