Finding index position of a character in a string and use it in substring functions in dataframe-CodePudding

data frame :

I need to truncate the String column value based on the # position. The result should be :

I am trying this code but it is throwing a TypeError :

Though I can achieve the desired result using SparkSql or by creating a function in Python, is there any way that it can be done in pyspark itself?

CodePudding user response：

Another way is to use locate within the substr function, but this can only be used with expr.

spark.sparkContext.parallelize([('WALGREENS #6411',), ('CVS/PHARMACY #08864',), ('CVS',)]).toDF(['acct']). \
    withColumn('acct_name', 
               func.when(func.col('acct').like('%#%') == False, func.col('acct')).
               otherwise(func.expr('substr(acct, 1, locate("#", acct)-2)'))
               ). \
    show()

#  ------------------- ------------ 
# |               acct|   acct_name|
#  ------------------- ------------ 
# |    WALGREENS #6411|   WALGREENS|
# |CVS/PHARMACY #08864|CVS/PHARMACY|
# |                CVS|         CVS|
#  ------------------- ------------

CodePudding user response：

You can use split() function to achieve this. I used split function with delimiter as # to get the required value and removed leading spaces with rtrim().

My input:

 --- ------------------- 
| id|             string|
 --- ------------------- 
|  1|    WALGREENS #6411|
|  2|CVS/PHARMACY #08864|
|  3|                CVS|
|  4|          WALGREENS|
|  5|         Test #1234|
 --- -------------------

Try using the following code:

from pyspark.sql.functions import split,col,rtrim

df = df.withColumn("New_string", split(col("string"), "#").getItem(0))
#you can also use substring_index() 
#df.withColumn("result", substring_index(df['string'], '#',1))

df = df.withColumn('New_string', rtrim(df['New_string']))

df.show()

Output: