Home > OS >  Extract words from the text in Pyspark Dataframe
Extract words from the text in Pyspark Dataframe

Time:07-09

I have dataframe:

d = [{'text': 'They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.', 'begin_end': [128, 139]},
    {'text': 'Mom called dad, and when he came home, he took moms car and drove to the store', 'begin_end': [20,31]}]
s = spark.createDataFrame(d)

---------- ---------------------------------------------------------------------------------------------------------------------------- 
|begin_end |text                                                                                                                        |
 ---------- ---------------------------------------------------------------------------------------------------------------------------- 
|[111, 120]|They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.|
|[20, 31]  |Mom called dad, and when he came home, he took moms car and drove to the store                                              |
 ---------- ---------------------------------------------------------------------------------------------------------------------------- 

I needed to extract the words from the text column using the begin_end column array, like text[111:120 1]. In pandas, this could be done via zip:

df['new_col'] = [s[a:b 1] for s, (a,b) in zip(df['text'], df['begin_end'])]

result:

    begin_end     new_col
0   [111, 120]  jumps bad
1   [20, 31]    when he came

How can I rewrite zip function to pyspark and get new_col? Do I need to write a udf function for this?

CodePudding user response:

You can do so by using substring in an expression. It expects the string you want to substring, a starting position and the length of the substring. An expression is needed as the substring function from pyspark.sql.functions doesn't take a column as starting position or length.

s.withColumn('new_col', F.expr("substr(text, begin_end[0]   1, begin_end[1] - begin_end[0]   1)")).show()

 ---------- -------------------- ------------ 
| begin_end|                text|     new_col|
 ---------- -------------------- ------------ 
|[111, 120]|They say that all...|   jumps bad|
|  [20, 31]|Mom called dad, a...|when he came|
 ---------- -------------------- ------------ 
  • Related