I have a dataframe with a text column in this form:
column_description
"this section includes: animals: cats and dogs and vegetables but doesn’t include: plants and fruits: coco"
"this section includes the following: axis: x and y but doesn’t include: z, k and c"
"this section includes notably: letters: a, b and c however it doesn’t include: y and letter: z"
I want to separate the text within the column and get two new columns like the following:
column_include
"animals: cats and dogs and vegetables"
"axis: x and y"
"letters: a, b and c "
column_exclude
"plants and fruits: coco"
"z, k and c"
"y and letter: z"
How can I achieve this with Python libraries? maybe using NLP techniques?
CodePudding user response:
Here is a generic regex that works on your 3 cases:
regex = r'this section includes[^:]*: (.*) (?:but|however it) doesn’t include: (.*)'
df[['column_include', 'column_exclude']] = \
df['column_description'].str.extract(regex)
Output:
column_description column_include column_exclude
0 this section includes: animals: cats and dogs ... animals: cats and dogs and vegetables plants and fruits: coco
1 this section includes the following: axis: x a... axis: x and y z, k and c
2 this section includes notably: letters: a, b a... letters: a, b and c y and letter: z
If you want to make the second part optional, use a non-greedy quantifier (*?
), an optional group ((?:...)
) and an end of line anchor ($
):
regex = r'^this section\s*(?:includes[^:]*: (.*?))?\s*(?:(?:but|however it)? doesn’t include: (.*))?$'
CodePudding user response:
You could make a filtering function and apply it on your column
import pandas as pd
lst = [
"this section includes: animals: cats and dogs and vegetables but doesn’t include: plants and fruits: coco",
"this section includes the following: axis: x and y but doesn’t include: z, k and c",
"this section includes notably: letters: a, b and c however it doesn’t include: y and letter: z"
]
df = pd.DataFrame(lst, columns=['text_col'])
def filter_text(txt):
suffixes = ['but doesn’t', 'however it doesn’t']
for suffix in suffixes:
txt = txt.split(suffix)[0]
return ':'.join(txt.split(':')[1:])
df['text_col'] = df['text_col'].apply(filter_text)
print(df)
Input
text_col
0 this section includes: animals: cats and dogs ...
1 this section includes the following: axis: x a...
2 this section includes notably: letters: a, b a...
Output
text_col
0 animals: cats and dogs and vegetables
1 axis: x and y
2 letters: a, b and c