Apply wordninja.split() using pandas

I have a dataframe df with the column sld of type string which includes some consecutive characters with no space/delimiter. One of the libraries that can be used to split is wordninja:

E.g. wordninja.split('culturetosuccess') outputs ['culture','to','success']

Using pandas_udf, I have:

@pandas_udf(ArrayType(StringType()))
def split_word(x):
   splitted = wordninja.split(x)
   return splitted

However, it throws an error when I apply it on the column sld:

df1=df.withColumn('test', split_word(col('sld')))

typeerror: expected string or bytes-like object

What I tried:

I noticed that there is a similar problem with the well-known function split(), but the workaround is to use string.str as mentioned here. This doesn't work on wordninja.split.

Any work around this issue?

Edit: I think in a nutshell the issue is: the pandas_udf input is pd.series while wordninja.split expects string.

My df looks like this:

 ------------- 
|sld          |
 ------------- 
|"hellofriend"|
|"restinpeace"|
|"this"       |
|"that"       |
 -------------

I want something like this:

 ------------- --------------------- 
|    sld      |         test        |
 ------------- --------------------- 
|"hellofriend"|["hello","friend"]   |
|"restinpeace"|["rest","in","peace"]|
|"this"       |["this"]             |
|"that"       |["that"]             |
 ------------- ---------------------

CodePudding user response：

Just use .apply to perform computation on each element of the Pandas series, something like this:

@pandas_udf(ArrayType(StringType()))
def split_word(x: pd.Series) -> pd.Series:
   splitted = x.apply(lambda s: wordninja.split(s))
   return splitted

CodePudding user response：

As far as I know, inside pandas_udf, we can only use pandas. While wordninja is a separate library... I think, you can only do what you want using udf.

import wordninja
from pyspark.sql import functions as F
df = spark.createDataFrame([("hellofriend",), ("restinpeace",), ("this",), ("that",)], ['sld'])

@F.udf
def split_word(x):
   return wordninja.split(x)

df.withColumn('col2', split_word('sld')).show()
#  ----------- ----------------- 
# |        sld|             col2|
#  ----------- ----------------- 
# |hellofriend|  [hello, friend]|
# |restinpeace|[rest, in, peace]|
# |       this|           [this]|
# |       that|           [that]|
#  ----------- -----------------