I have a data frame containing text in one column and specified windows of interest in a tuple in another column. Consider this example.
import pandas as pd
df = pd.DataFrame(columns=['date', 'name', 'text', 'tuple'],
data = [['2011-01-01', "Peter", "Das ist nicht vielversprechend.", (101, 0, 3)],
['2012-01-01', "Michelle", "Du bist nicht misstrauisch.", (101, 1, 3)],
['2013-01-01', "Michelle", "Das ist eine vertrauenserweckende Aussage.", (101, 0, 1)],
['2014-01-01', "Peter", "Ich bin sehr nervös.", (101, 1, 3)]])
Ignoring the first entry of the tuple, I would now like to extract the word span defined in elements 1 & 2 (zero-indexed, excluding the second number) in the tuple from the column text and add this as a new column (words_of_interest
).
For example, from line 1, this should yield words 0-2 (up to and excl. word number 3): Expected output:
"Das ist nicht",
"bist nicht",
"Das"
"bin sehr"
I have tried various variations of .astype(str).str.split().str[i]
for the strings and .str.get(1)
for the span to no avail. Can someone help me?
Thanks in advance!
CodePudding user response:
One approach:
df["result"] = [" ".join(text.split()[start:end]) for text, (_, start, end) in zip(df["text"], df["tuple"])]
print(df)
Output
date name ... tuple result
0 2011-01-01 Peter ... (101, 0, 3) Das ist nicht
1 2012-01-01 Michelle ... (101, 1, 3) bist nicht
2 2013-01-01 Michelle ... (101, 0, 1) Das
3 2014-01-01 Peter ... (101, 1, 3) bin sehr
[4 rows x 5 columns]
CodePudding user response:
Quite straight forward with apply
:
df['out'] = df.apply(lambda x: ' '.join(x['text'].split()[x['tuple'][1]:x['tuple'][2]]),
axis=1)