I have a dataframe df
with the column sld
of type string which includes some consecutive characters with no space/delimiter. One of the libraries that can be used to split is wordninja:
E.g. wordninja.split('culturetosuccess')
outputs ['culture','to','success']
Using pandas_udf
, I have:
@pandas_udf(ArrayType(StringType()))
def split_word(x):
splitted = wordninja.split(x)
return splitted
However, it throws an error when I apply it on the column sld
:
df1=df.withColumn('test', split_word(col('sld')))
typeerror: expected string or bytes-like object
What I tried:
I noticed that there is a similar problem with the well-known function split()
, but the workaround is to use string.str
as mentioned here. This doesn't work on wordninja.split
.
Any work around this issue?
Edit: I think in a nutshell the issue is:
the pandas_udf
input is pd.series
while wordninja.split
expects string.
My df looks like this:
-------------
|sld |
-------------
|"hellofriend"|
|"restinpeace"|
|"this" |
|"that" |
-------------
I want something like this:
------------- ---------------------
| sld | test |
------------- ---------------------
|"hellofriend"|["hello","friend"] |
|"restinpeace"|["rest","in","peace"]|
|"this" |["this"] |
|"that" |["that"] |
------------- ---------------------
CodePudding user response:
Just use .apply
to perform computation on each element of the Pandas series, something like this:
@pandas_udf(ArrayType(StringType()))
def split_word(x: pd.Series) -> pd.Series:
splitted = x.apply(lambda s: wordninja.split(s))
return splitted
CodePudding user response:
As far as I know, inside pandas_udf
, we can only use pandas. While wordninja
is a separate library... I think, you can only do what you want using udf
.
import wordninja
from pyspark.sql import functions as F
df = spark.createDataFrame([("hellofriend",), ("restinpeace",), ("this",), ("that",)], ['sld'])
@F.udf
def split_word(x):
return wordninja.split(x)
df.withColumn('col2', split_word('sld')).show()
# ----------- -----------------
# | sld| col2|
# ----------- -----------------
# |hellofriend| [hello, friend]|
# |restinpeace|[rest, in, peace]|
# | this| [this]|
# | that| [that]|
# ----------- -----------------