I have a data frame like below in pyspark
df = spark.createDataFrame(
[
('14|100|00|123|567','A',25),
('13|200|20|323|467','B',24),
('15|300|30|223|667','A',20)],("rust", "name", "value"))
I want to achieve below
1) Extract substring from rust column between 1st and 2nd | as new column
2) Extract substring from rust column between 2nd and 3rd | as new column
3) Extract substring from rust column after 3rd | as new column
I have tried like below
df1 = df.select("*", f.substring_index(df.rust, '|', 2))
df1.show()
----------------- ---- ----- ---------------------------
| rust|name|value|substring_index(rust, |, 2)|
----------------- ---- ----- ---------------------------
|14|100|00|123|567| A| 25| 14|100|
|13|200|20|323|467| B| 24| 13|200|
|15|300|30|223|667| A| 20| 15|300|
----------------- ---- ----- ---------------------------
Expected_result is below
----------------- ---- ----- ------------- ------------ -----------
| rust|name|value|second_string|third_string|last_string|
----------------- ---- ----- ------------- ------------ -----------
|14|100|00|123|567| A| 25| 100| 00| 123|567|
|13|200|20|323|467| B| 24| 200| 20| 323|467|
|15|300|30|223|667| A| 20| 300| 30| 223|667|
----------------- ---- ----- ------------- ------------ -----------
How can I achieve what I want
CodePudding user response:
df = spark.createDataFrame(
[
('14|100|00|123|567','A',25),
('13|200|20|323|467','B',24),
('15|300|30|223|667','A',20)],("rust", "name", "value"))
(
df
.withColumn("rust_list", split(col('rust'), '\|')) # [14, 100, 00, 123, 567]
.withColumn('second_string', col('rust_list')[1])
.withColumn('third_string', col('rust_list')[2])
.withColumn('last_string', concat_ws('|', col('rust_list')[3], col('rust_list')[4]))
.drop('rust_list')
.show(10, False)
)
# ----------------- ---- ----- ------------- ------------ -----------
# |rust |name|value|second_string|third_string|last_string|
# ----------------- ---- ----- ------------- ------------ -----------
# |14|100|00|123|567|A |25 |100 |00 |123|567 |
# |13|200|20|323|467|B |24 |200 |20 |323|467 |
# |15|300|30|223|667|A |20 |300 |30 |223|667 |
# ----------------- ---- ----- ------------- ------------ -----------