I have the following DF
name
Shane
Judith
Rick Grimes
I want to generate the following one
name substr
Shane hane
Judith udith
Rick Grimes ick Grimes
I tried:
F.substring(F.col('name'), 1)
F.substring(F.col('name'), 1, None)
F.substring(F.col('name'), 1, F.length(F.col('name')))
But all of those methods throws me an error.
How can I get my desired output?
CodePudding user response:
spark documentation for substring states that index is not 0 based
The position is not zero based, but 1 based index.
pyspark.sql.functions.substring
df.withColumn('sub_string', F.substring('name', 2, F.length('name')))
CodePudding user response:
You can use expr
to get the desired output
F.withColumn('substr',expr("substring(name, 2, length(name)-1)"))
CodePudding user response:
F.substring
takes the integer so it only works if you pass integers.
F.substring('name', 2, 5)
# This doesn't work. substring doesn't take Column (F.length())
F.substring('name', 2, F.length('name'))
If you would like to pass a dynamic value, you can do either SQL's substring or Col.substr
.
SQL
F.expr('substring(name, 2, length(name))')
This will take Column
(Many Pyspark function returns Column
including F.length
) or int
. Although, startPos and length has to be in the same type. eg: If you need to pass Column
for length, use lit
for the startPos.
F.col('name').substr(F.lit(2), F.length('name'))
# If you pass integer for both.
# F.col('name').substr(2, 5)