Home > front end >  Extract words surrounding a word and inserting results in a dataframe column (additional)
Extract words surrounding a word and inserting results in a dataframe column (additional)

Time:10-05

Following on this post Extract words surrounding a word and inserting results in a dataframe column

How about when the word "products" is mentioned multiple times in a text. How can you get x words before and after all at once (including the word we are searching for). For instance:

company | year | text  
Apple   | 2016 |"The Company sells a lot of its products worldwide in order to make it accessible to the public often. Everyday multiple consumer purchase many of those products for their family members. That being said we need to make these milk products accessible to every family in the US. "  

The solution given:

pat = '(?P<before>(?:\w \W ){,3})products\W (?P<after>(?:\w \W ){,3})'
new = df.text.str.extract(pat, expand=True)

returns nan for before and after.

I want my results to look like this:

Output
"lot of its products worldwide in order, many of those products for their family, make these milk products accessible to every"    

CodePudding user response:

Not sure about the extract approach you propose, however if you don't mind using a custom function, here is one solution to get the output:

s = """The Company sells a lot of its products worldwide in order to make it accessible to the public often. Everyday multiple consumer purchase many of those products for their family members. That being said we need to make these milk products accessible to every family in the US. """

def parse(s, keyword, border=3):
    cleanword = lambda wrd: wrd.replace(" ", "").replace(".","").replace("!","").replace("?","")
    words = [cleanword(wrd) for wrd in s.split()]
    i = -1
    result = []
    while i < len(words):
        try:
            i = words.index(keyword, i 1, len(words))
        except ValueError:
            break
        sel = words[i-border: i border 1]
        result.append(" ".join(sel))
    return ", ".join(result)

parse(s, 'products')
# 'lot of its products worldwide in order, many of those products for their family, make these milk products accessible to every'

You can map a dataframe column using this method:

df = pd.DataFrame({"input":[s]})
df['output'] = df.input.map(lambda x: parse(x, 'products'))

You can also combine the linked pattern recognition example with re.findall to get the desired output. Here is one way, though I'm sure there are many others:

pat = '(?P<before>(?:\w \W ){,3})%s\W (?P<after>(?:\w \W ){,3})'
keywrd = "products"
output = [(before, keywrd, after) for before, after in re.findall(pat%keywrd, s)]
#output: 
#[('lot of its ', 'products', 'worldwide in order '),
# ('many of those ', 'products', 'for their family '),
# ('make these milk ', 'products', 'accessible to every ')]
  • Related