Home > Enterprise >  Get a substring from pyspark DF
Get a substring from pyspark DF

Time:12-29

I have the following DF

name
Shane
Judith
Rick Grimes

I want to generate the following one

name           substr
Shane          hane
Judith         udith
Rick Grimes    ick Grimes

I tried:

F.substring(F.col('name'), 1)
F.substring(F.col('name'), 1, None)
F.substring(F.col('name'), 1, F.length(F.col('name')))

But all of those methods throws me an error.

How can I get my desired output?

CodePudding user response:

spark documentation for substring states that index is not 0 based

The position is not zero based, but 1 based index.

pyspark.sql.functions.substring

df.withColumn('sub_string', F.substring('name', 2, F.length('name')))

CodePudding user response:

You can use expr to get the desired output

F.withColumn('substr',expr("substring(name, 2, length(name)-1)"))

CodePudding user response:

F.substring takes the integer so it only works if you pass integers.

F.substring('name', 2, 5)

# This doesn't work. substring doesn't take Column (F.length()) 
F.substring('name', 2, F.length('name'))

If you would like to pass a dynamic value, you can do either SQL's substring or Col.substr.

SQL

F.expr('substring(name, 2, length(name))')

Col.substr(startPos, length)

This will take Column (Many Pyspark function returns Column including F.length) or int. Although, startPos and length has to be in the same type. eg: If you need to pass Column for length, use lit for the startPos.

F.col('name').substr(F.lit(2), F.length('name'))

# If you pass integer for both.
# F.col('name').substr(2, 5)
  • Related