Home > Mobile >  Splitting a string column into into 2 in PySpark
Splitting a string column into into 2 in PySpark

Time:06-07

Using PySpark, I need to parse a single dataframe column into two columns.

Input data:

file name
/level1/level2/level3/file1.ext
/level1/file1000.ext
/level1/level2/file20.ext

Output:

file name path
file1.ext /level1/level2/level3/
file1000.ext /level1/
file20.ext /level1/level2/

I know I could use substring with hard coded positions, but this is not a good case for hard coding as the length of the file name values may change from row to row, as shown in the example.

However, I know that I need to break the input string after the last slash (/). This is a rule to help avoid hard coding a specific position for splitting the input string.

CodePudding user response:

There are several ways to do it with regex functions, or with the split method.

from pyspark.sql.functions import split, element_at, regexp_extract

df \

.withColumn("file_name", element_at(split("raw", "/"), -1) ) \
.withColumn("file_name2", regexp_extract("raw", "(?<=/)[^/] $", 0)) \
.withColumn("path", regexp_extract("raw", "^.*/", 0)) \
.show(truncate=False)

 ------------------------------- ------------ ------------ ---------------------- 
|raw                            |file_name   |file_name2  |path                  |
 ------------------------------- ------------ ------------ ---------------------- 
|/level1/level2/level3/file1.ext|file1.ext   |file1.ext   |/level1/level2/level3/|
|/level1/file1000.ext           |file1000.ext|file1000.ext|/level1/              |
|/level1/level2/file20.ext      |file20.ext  |file20.ext  |/level1/level2/       |
 ------------------------------- ------------ ------------ ---------------------- 

CodePudding user response:

A couple of other options:

from pyspark.sql import functions as F
df=spark.createDataFrame(
    [('/level1/level2/level3/file1.ext',),
     ('/level1/file1000.ext',),
     ('/level1/level2/file20.ext',)],
    ['file_name']
)
df = df.withColumn('file', F.substring_index('file_name', '/', -1))
df = df.withColumn('path', F.expr('left(file_name, length(file_name) - length(file))'))
df.show(truncate=0)
#  ------------------------------- ------------ ---------------------- 
# |file_name                      |file        |path                  |
#  ------------------------------- ------------ ---------------------- 
# |/level1/level2/level3/file1.ext|file1.ext   |/level1/level2/level3/|
# |/level1/file1000.ext           |file1000.ext|/level1/              |
# |/level1/level2/file20.ext      |file20.ext  |/level1/level2/       |
#  ------------------------------- ------------ ---------------------- 
  • Related