str.findall returns all NA's-CodePudding

I have this df1 with a lot of different news articles. An example of a news article is this:

'Today is Monday Aug. 17 the 230th day of 2020 . There are 136 days left in the year . On August 17 2017 a van plowed through pedestrians along a packed promenade in the Spanish city of Barcelona killing 13 people and injuring 120 . A 14th victim died later from injuries . Another man was stabbed to death in a carjacking that night as the van driver made his getaway and a woman died early the next day in a vehicle-and-knife attack in a nearby coastal town . Six by police two more died when a bomb workshop exploded . In 1915 a mob in Cobb County Georgia lynched Jewish businessman Leo Frank 31 whose death sentence for the murder of 13-year-old Mary Phagan had been commuted to life imprisonment . Frank who d maintained his innocence was pardoned by the state of Georgia in 1986 . In 1960 the newly renamed Beatles formerly the Silver Beetles began their first gig in Hamburg West Germany Teamsters union president Jimmy Hoffa was sentenced in Chicago to five years in federal prison for defrauding his union s pension fund . Hoffa was released in 1971 after President Richard Nixon commuted his sentence for this conviction and jury tampering . In 1969 Hurricane Camille slammed into the Mississippi coast as a Category 5 storm that was blamed for 256 U.S. deaths three in Cuba . In 1978 the first successful trans-Atlantic balloon flight ended as Maxie Anderson Ben Abruzzo and Larry Newman landed In 1982 the first commercially produced compact discs a recording of ABBA s The Visitors were pressed at a Philips factory near Hanover West Germany .'

And I have this df2 with all the words from the news articles in the column "Word" with their corresponding LIWC category in the second column.

Data example:

data = {'Word': ['killing','even','guilty','brain'], 'Category': ['Affect', 'Adverb', 'Anx','Body']}

What I'm trying to do is: To calculate for each article in df1 how many words occur of each category in df2. So I want to create a column for each category mentioned in df2["category"]. And it should look like this in the end:

 Content              | Achieve | Affiliation   | affect
article text here     | 6       | 2             | 2 
article text here     | 2       | 43            | 2
article text here     | 6       | 8             | 8 
article text here     | 2       | 13            | 7

I since it's all strings I tried str.findall but this returns all NA's for everything. This is what I tried:

from collections import Counter
liwc = df1['articles'].str.findall(fr"'({'|'.join(df2)})'") \
         .apply(lambda x: pd.Series(Counter(x), index=df2["category"].unique())) \
         .fillna(0).astype(int)

Both a pandas or r solution would be equally great.

CodePudding user response：

You can craft a custom regex with named capturing groups and use str.extractall.

With your dictionary the custom regex would be '(?P<Affect>\\bkilling\\b)|(?P<Adverb>\\beven\\b)|(?P<Anx>\\bguilty\\b)|(?P<Body>\\bbrain\\b)'

Then groupby max the notna results, convert to int and join to the original dataframe:

regex = '|'.join(fr'(?P<{k}>\b{v}\b)' for v,k  in zip(*data.values()))
(df1.join(df1['articles'].str.extractall(regex, flags=2) # re.IGNORECASE
             .notna().groupby(level=0).max()
             .astype(int)
         )
)

output:

                                         articles  Affect  Adverb  Anx  Body
0  Today is killing Aug. 17 the 230th day of 2020       1       0    0     0
1  Today is brain Aug. 17 the guilty day of 2020        0       0    1     1

CodePudding user response：

First flatten df2 values to dictionary, add word boundaries \b\b and pass to Series.str.extractall, so possible use Series.map and create DataFrame by reset_index, last pass to crosstab and append to original by DataFrame.join:

df1 = pd.DataFrame({'articles':['Today is killing Aug. 17 the 230th day of 2020',
                                'Today is brain Aug. 17 the guilty day of 2020 ']})

print (df1)
                                         articles
0  Today is killing Aug. 17 the 230th day of 2020
1  Today is brain Aug. 17 the guilty day of 2020

If list of values in Word column like in picture:

data = {'Word': [['killing'],['even'],['guilty'],['brain']], 
       'Category': ['Affect', 'Adverb', 'Anx','Body']} 
df2 = pd.DataFrame(data)
print (df2)
        Word Category
0  [killing]   Affect
1     [even]   Adverb
2   [guilty]      Anx
3    [brain]     Body


d = {x: b for a, b in zip(df2['Word'], df2['Category']) for x in a}
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}

If df2 is different:

data = {'Word': ['killing','even','guilty','brain'],
        'Category': ['Affect', 'Adverb', 'Anx','Body']} 
df2 = pd.DataFrame(data)
print (df2)

0  killing   Affect
1     even   Adverb
2   guilty      Anx
3    brain     Body
    
d = dict(zip(df2['Word'], df2['Category']))
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}

import re

#thank you for improve solution Wiktor Stribiżew
pat = r"\b(?:{})\b".format("|".join(re.escape(x) for x in d))
df = df1['articles'].str.extractall(rf'({pat})')[0].map(d).reset_index(name='Category')

df = df1.join(pd.crosstab(df['level_0'], df['Category']))
print (df)
                                         articles  Affect  Anx  Body
0  Today is killing Aug. 17 the 230th day of 2020       1    0     0
1  Today is brain Aug. 17 the guilty day of 2020        0    1     1