Extract all phrases from a pandas dataframe based on multiple words in list-CodePudding

I have a list, L:

L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']

I have a pandas DataFrame, DF:

Text
the objects are both before and after the person
the object is behind the person
the object in right is next to top left hand side of person

I would like to extract all words in L from the DF column 'Text' in such a manner:

Text	Extracted_Value
the objects are both before and after the person	before_after
the object is behind the person	behind
the object in right is next to top left hand side of person	right_top left hand side

For case 1 and 2, my code is working:

L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
pattern = r"(?:^|\s )("   "|".join(L)   r")(?:\s |$)"
df["Extracted_Value "] = (
    df['Text'].str.findall(pattern).str.join("_").replace({"": None})
)

For CASE 3, I get right_top_hand.

As in the third example, If identified words are contiguous, they are to be picked up as a phrase (one extraction). So in the object in right is next to top left hand side of person, there are two extractions - right and top left hand side. Hence, only these two extractions are separated by an _.

I am not sure how to get it to work!

CodePudding user response：

Try:

df["Extracted_Value"] = (
    df.Text.apply(
        lambda x: "|".join(w if w in L else "" for w in x.split()).strip("|")
    )
    .replace(r"\|{2,}", "_", regex=True)
    .str.replace("|", " ", regex=False)
)
print(df)

Prints:

                                                          Text           Extracted_Value
0             the objects are both before and after the person              before_after
1                              the object is behind the person                    behind
2  the object in right is next to top left hand side of person  right_top left hand side

EDIT: Adapting @Wiktor's answer to pandas:

pattern = fr"\b((?:{'|'.join(L)})(?:\s (?:{'|'.join(L)}))*)\b"

df["Extracted_Value"] = (
    df["Text"].str.extractall(pattern).groupby(level=0).agg("_".join)
)
print(df)

CodePudding user response：

You need to use

pattern = fr"\b(?:{'|'.join(L)})(?:\s (?:{'|'.join(L)}))*\b"

The regex will look like

\b(?:top|left|behind|before|right|after|hand|side)(?:\s (?:top|left|behind|before|right|after|hand|side))*\b

See the regex demo.

It will match

\b - a word boundary
(?:{'|'.join(L)}) - one of the words in L
(?:\s (?:{'|'.join(L)}))* - zero or more repetitions of one or more whitespaces and then a word from the L list
\b - a word boundary.

Python demo:

import pandas as pd
L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
df = pd.DataFrame({'Text':["the objects are both before and after the person","the object is behind the person", "the object in right is next to top left hand side of person"]})
pattern = fr"\b(?:{'|'.join(L)})(?:\s (?:{'|'.join(L)}))*\b"

Output:

>>> df['Text'].str.findall(pattern).str.join("_").replace({"": None})
0                before_after
1                      behind
2    right_top left hand side
Name: Text, dtype: object

CodePudding user response：

This works for me, it just compares each items in the list with each item in the the phrase in each row.

L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']

df = pd.DataFrame(
['the objects are both before and after the person',
'the object is behind the person',
'the object in right is next to top left hand side of person'], columns=['Text'])

df['Extracted_Value'] = df['Text'].str.split().apply(lambda x: '_'.join([m for m in x if m in L])).replace('',np.nan)

My output is,

    Text    Extracted_Value
0   the objects are both before and after the person    before_after
1   the object is behind the person                     behind
2   the object in right is next to top left hand s...   right_top_left_hand_side