I have dataframe:
d = [{'text': 'They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.', 'begin_end': [128, 139]},
{'text': 'Mom called dad, and when he came home, he took moms car and drove to the store', 'begin_end': [20,31]}]
s = spark.createDataFrame(d)
---------- ----------------------------------------------------------------------------------------------------------------------------
|begin_end |text |
---------- ----------------------------------------------------------------------------------------------------------------------------
|[111, 120]|They say that all cats land on their feet, but this does not apply to my cat. He not only often falls, but also jumps badly.|
|[20, 31] |Mom called dad, and when he came home, he took moms car and drove to the store |
---------- ----------------------------------------------------------------------------------------------------------------------------
I needed to extract the words from the text
column using the begin_end
column array, like text[111:120 1]
. In pandas, this could be done via zip
:
df['new_col'] = [s[a:b 1] for s, (a,b) in zip(df['text'], df['begin_end'])]
result:
begin_end new_col
0 [111, 120] jumps bad
1 [20, 31] when he came
How can I rewrite zip
function to pyspark and get new_col
? Do I need to write a udf function for this?
CodePudding user response:
You can do so by using substring
in an expression. It expects the string you want to substring, a starting position and the length of the substring. An expression is needed as the substring function from pyspark.sql.functions
doesn't take a column as starting position or length.
s.withColumn('new_col', F.expr("substr(text, begin_end[0] 1, begin_end[1] - begin_end[0] 1)")).show()
---------- -------------------- ------------
| begin_end| text| new_col|
---------- -------------------- ------------
|[111, 120]|They say that all...| jumps bad|
| [20, 31]|Mom called dad, a...|when he came|
---------- -------------------- ------------