Home > Net >  Finding index position of a character in a string and use it in substring functions in dataframe
Finding index position of a character in a string and use it in substring functions in dataframe

Time:08-24

data frame :

enter image description here

I need to truncate the String column value based on the # position. The result should be :

enter image description here

I am trying this code but it is throwing a TypeError :

enter image description here

Though I can achieve the desired result using SparkSql or by creating a function in Python, is there any way that it can be done in pyspark itself?

CodePudding user response:

Another way is to use locate within the substr function, but this can only be used with expr.

spark.sparkContext.parallelize([('WALGREENS #6411',), ('CVS/PHARMACY #08864',), ('CVS',)]).toDF(['acct']). \
    withColumn('acct_name', 
               func.when(func.col('acct').like('%#%') == False, func.col('acct')).
               otherwise(func.expr('substr(acct, 1, locate("#", acct)-2)'))
               ). \
    show()

#  ------------------- ------------ 
# |               acct|   acct_name|
#  ------------------- ------------ 
# |    WALGREENS #6411|   WALGREENS|
# |CVS/PHARMACY #08864|CVS/PHARMACY|
# |                CVS|         CVS|
#  ------------------- ------------ 

CodePudding user response:

You can use split() function to achieve this. I used split function with delimiter as # to get the required value and removed leading spaces with rtrim().

  • My input:
 --- ------------------- 
| id|             string|
 --- ------------------- 
|  1|    WALGREENS #6411|
|  2|CVS/PHARMACY #08864|
|  3|                CVS|
|  4|          WALGREENS|
|  5|         Test #1234|
 --- ------------------- 
  • Try using the following code:
from pyspark.sql.functions import split,col,rtrim

df = df.withColumn("New_string", split(col("string"), "#").getItem(0))
#you can also use substring_index() 
#df.withColumn("result", substring_index(df['string'], '#',1))

df = df.withColumn('New_string', rtrim(df['New_string']))

df.show()
  • Output: enter image description here
  • Related