Home > other >  extract words from column according to range defined in second column
extract words from column according to range defined in second column

Time:10-21

I have a data frame containing text in one column and specified windows of interest in a tuple in another column. Consider this example.

import pandas as pd

df = pd.DataFrame(columns=['date', 'name', 'text', 'tuple'],
                  data = [['2011-01-01', "Peter",    "Das ist nicht vielversprechend.",            (101, 0, 3)],
                          ['2012-01-01', "Michelle", "Du bist nicht misstrauisch.",                (101, 1, 3)],
                          ['2013-01-01', "Michelle", "Das ist eine vertrauenserweckende Aussage.", (101, 0, 1)],
                          ['2014-01-01', "Peter",    "Ich bin sehr nervös.",                       (101, 1, 3)]])

Ignoring the first entry of the tuple, I would now like to extract the word span defined in elements 1 & 2 (zero-indexed, excluding the second number) in the tuple from the column text and add this as a new column (words_of_interest).

For example, from line 1, this should yield words 0-2 (up to and excl. word number 3): Expected output:

"Das ist nicht", 
"bist nicht",
"Das"
"bin sehr"

I have tried various variations of .astype(str).str.split().str[i] for the strings and .str.get(1) for the span to no avail. Can someone help me?

Thanks in advance!

CodePudding user response:

One approach:

df["result"] = [" ".join(text.split()[start:end]) for text, (_, start, end) in zip(df["text"], df["tuple"])]
print(df)

Output

         date      name  ...        tuple         result
0  2011-01-01     Peter  ...  (101, 0, 3)  Das ist nicht
1  2012-01-01  Michelle  ...  (101, 1, 3)     bist nicht
2  2013-01-01  Michelle  ...  (101, 0, 1)            Das
3  2014-01-01     Peter  ...  (101, 1, 3)       bin sehr

[4 rows x 5 columns]

CodePudding user response:

Quite straight forward with apply:

df['out'] = df.apply(lambda x: ' '.join(x['text'].split()[x['tuple'][1]:x['tuple'][2]]),
         axis=1)
  • Related