I have a list, L:
L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
I have a pandas DataFrame, DF:
Text |
---|
the objects are both before and after the person |
the object is behind the person |
the object in right is next to top left hand side of person |
I would like to extract all words in L from the DF column 'Text' in such a manner:
Text | Extracted_Value |
---|---|
the objects are both before and after the person | before_after |
the object is behind the person | behind |
the object in right is next to top left hand side of person | right_top left hand side |
For case 1 and 2, my code is working:
L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
pattern = r"(?:^|\s )(" "|".join(L) r")(?:\s |$)"
df["Extracted_Value "] = (
df['Text'].str.findall(pattern).str.join("_").replace({"": None})
)
For CASE 3, I get right_top_hand
.
As in the third example, If identified words are contiguous, they are to be picked up as a phrase (one extraction). So in the object in right is next to top left hand side of person, there are two extractions - right and top left hand side. Hence, only these two extractions are separated by an _
.
I am not sure how to get it to work!
CodePudding user response:
Try:
df["Extracted_Value"] = (
df.Text.apply(
lambda x: "|".join(w if w in L else "" for w in x.split()).strip("|")
)
.replace(r"\|{2,}", "_", regex=True)
.str.replace("|", " ", regex=False)
)
print(df)
Prints:
Text Extracted_Value
0 the objects are both before and after the person before_after
1 the object is behind the person behind
2 the object in right is next to top left hand side of person right_top left hand side
EDIT: Adapting @Wiktor's answer to pandas:
pattern = fr"\b((?:{'|'.join(L)})(?:\s (?:{'|'.join(L)}))*)\b"
df["Extracted_Value"] = (
df["Text"].str.extractall(pattern).groupby(level=0).agg("_".join)
)
print(df)
CodePudding user response:
You need to use
pattern = fr"\b(?:{'|'.join(L)})(?:\s (?:{'|'.join(L)}))*\b"
The regex will look like
\b(?:top|left|behind|before|right|after|hand|side)(?:\s (?:top|left|behind|before|right|after|hand|side))*\b
See the regex demo.
It will match
\b
- a word boundary(?:{'|'.join(L)})
- one of the words inL
(?:\s (?:{'|'.join(L)}))*
- zero or more repetitions of one or more whitespaces and then a word from theL
list\b
- a word boundary.
Python demo:
import pandas as pd
L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
df = pd.DataFrame({'Text':["the objects are both before and after the person","the object is behind the person", "the object in right is next to top left hand side of person"]})
pattern = fr"\b(?:{'|'.join(L)})(?:\s (?:{'|'.join(L)}))*\b"
Output:
>>> df['Text'].str.findall(pattern).str.join("_").replace({"": None})
0 before_after
1 behind
2 right_top left hand side
Name: Text, dtype: object
CodePudding user response:
This works for me, it just compares each items in the list with each item in the the phrase in each row.
L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
df = pd.DataFrame(
['the objects are both before and after the person',
'the object is behind the person',
'the object in right is next to top left hand side of person'], columns=['Text'])
df['Extracted_Value'] = df['Text'].str.split().apply(lambda x: '_'.join([m for m in x if m in L])).replace('',np.nan)
My output is,
Text Extracted_Value
0 the objects are both before and after the person before_after
1 the object is behind the person behind
2 the object in right is next to top left hand s... right_top_left_hand_side