Using PySpark, I need to parse a single dataframe column into two columns.
Input data:
file name |
---|
/level1/level2/level3/file1.ext |
/level1/file1000.ext |
/level1/level2/file20.ext |
Output:
file name | path |
---|---|
file1.ext | /level1/level2/level3/ |
file1000.ext | /level1/ |
file20.ext | /level1/level2/ |
I know I could use substring with hard coded positions, but this is not a good case for hard coding as the length of the file name values may change from row to row, as shown in the example.
However, I know that I need to break the input string after the last slash (/). This is a rule to help avoid hard coding a specific position for splitting the input string.
CodePudding user response:
There are several ways to do it with regex functions, or with the split method.
from pyspark.sql.functions import split, element_at, regexp_extract
df \
.withColumn("file_name", element_at(split("raw", "/"), -1) ) \
.withColumn("file_name2", regexp_extract("raw", "(?<=/)[^/] $", 0)) \
.withColumn("path", regexp_extract("raw", "^.*/", 0)) \
.show(truncate=False)
------------------------------- ------------ ------------ ----------------------
|raw |file_name |file_name2 |path |
------------------------------- ------------ ------------ ----------------------
|/level1/level2/level3/file1.ext|file1.ext |file1.ext |/level1/level2/level3/|
|/level1/file1000.ext |file1000.ext|file1000.ext|/level1/ |
|/level1/level2/file20.ext |file20.ext |file20.ext |/level1/level2/ |
------------------------------- ------------ ------------ ----------------------
CodePudding user response:
A couple of other options:
from pyspark.sql import functions as F
df=spark.createDataFrame(
[('/level1/level2/level3/file1.ext',),
('/level1/file1000.ext',),
('/level1/level2/file20.ext',)],
['file_name']
)
df = df.withColumn('file', F.substring_index('file_name', '/', -1))
df = df.withColumn('path', F.expr('left(file_name, length(file_name) - length(file))'))
df.show(truncate=0)
# ------------------------------- ------------ ----------------------
# |file_name |file |path |
# ------------------------------- ------------ ----------------------
# |/level1/level2/level3/file1.ext|file1.ext |/level1/level2/level3/|
# |/level1/file1000.ext |file1000.ext|/level1/ |
# |/level1/level2/file20.ext |file20.ext |/level1/level2/ |
# ------------------------------- ------------ ----------------------