split dataframe text column with given strings

I have a dataframe with a text column in this form:

column_description

"this section includes: animals: cats and dogs and vegetables but doesn’t include: plants and fruits: coco"
"this section includes the following: axis: x and y but doesn’t include: z, k and c"
"this section includes notably: letters: a, b and c however it doesn’t include: y and letter: z"

I want to separate the text within the column and get two new columns like the following:

column_include

"animals: cats and dogs and vegetables"
"axis: x and y"
"letters: a, b and c "

column_exclude

"plants and fruits: coco"
"z, k and c"
"y and letter: z"

How can I achieve this with Python libraries? maybe using NLP techniques?

CodePudding user response：

Here is a generic regex that works on your 3 cases:

regex = r'this section includes[^:]*: (.*) (?:but|however it) doesn’t include: (.*)'

df[['column_include', 'column_exclude']] = \
df['column_description'].str.extract(regex)

Output:

                                  column_description                         column_include           column_exclude
0  this section includes: animals: cats and dogs ...  animals: cats and dogs and vegetables  plants and fruits: coco
1  this section includes the following: axis: x a...                          axis: x and y               z, k and c
2  this section includes notably: letters: a, b a...                    letters: a, b and c          y and letter: z

regex demo

If you want to make the second part optional, use a non-greedy quantifier (*?), an optional group ((?:...)) and an end of line anchor ($):

regex = r'^this section\s*(?:includes[^:]*: (.*?))?\s*(?:(?:but|however it)? doesn’t include: (.*))?$'

regex demo

CodePudding user response：

You could make a filtering function and apply it on your column

import pandas as pd

lst = [
"this section includes: animals: cats and dogs and vegetables but doesn’t include: plants and fruits: coco",
"this section includes the following: axis: x and y but doesn’t include: z, k and c",
"this section includes notably: letters: a, b and c however it doesn’t include: y and letter: z"
    ]

df = pd.DataFrame(lst, columns=['text_col'])


def filter_text(txt):
    suffixes = ['but doesn’t', 'however it doesn’t']
    for suffix in suffixes:
        txt = txt.split(suffix)[0]
    return ':'.join(txt.split(':')[1:])


df['text_col'] = df['text_col'].apply(filter_text)
print(df)

Input

                                            text_col
0  this section includes: animals: cats and dogs ...
1  this section includes the following: axis: x a...
2  this section includes notably: letters: a, b a...

Output

                                  text_col
0   animals: cats and dogs and vegetables 
1                           axis: x and y 
2                     letters: a, b and c