Home > Back-end >  Python Pandas SUMIF excel equivalent
Python Pandas SUMIF excel equivalent

Time:05-15

I can't figure out how to achieve a certain task in my python script.

I have a dataframe that contains media coverage for a specific topic. One of my columns names the author of the specific article for example...

enter image description here

I am trying to create a pivot table to show the total count for each journalist in the column like so

datajournalist = company1_topline.pivot_table(index='AuthorUsername', values='ContentID', aggfunc= np.count_nonzero )

which would give me something like

AuthorUsername                                                 count_nonzero
Aaron Mehta                                                      1              
Aamer Madhani                                                    1               
Aamer Madhani ; Ben Fox                                          1      

What I am looking for is a way to have the pivot table also count the names that appear in multiple value cells, to get a true count of each author. So for example the row that has "Aamer Madhani ; Ben Fox" would also count towards the count for just "Aamer Madhani" so the row for just "Aamer Madhani" would have a count of 2 instead of 1 etc.... Is there a way to do this? In excel this can be achieved with a SUMIF, but I do not know how to do this with Python/Pandas.

Desired output

AuthorUsername                                                 count_nonzero
Aaron Mehta                                                      1              
Aamer Madhani                                                    2               
Aamer Madhani ; Ben Fox                                          1     

If anyone can point me in the right direction it would be much appreciated.

CodePudding user response:

With your DataFrame having the AuthorUsername column like this:

            AuthorUsername
0              Aaron Mehta
1            Aamer Madhani
2  Aamer Madhani ; Ben Fox

You can do:

import collections

# Remove leading and trailing spaces (if any).
df['AuthorUsername'] = df['AuthorUsername'].str.strip()

# Get unique authors and their counts.
authors_counts = collections.Counter(df['AuthorUsername'].str.split('\s*;\s*').sum())

# Add to new column.
real_counts = collections.defaultdict(lambda: 1, authors_counts)
df['count_nonzero'] = [real_counts[a] for a in df['AuthorUsername']]

print(df)

Result:

            AuthorUsername  count_nonzero
0              Aaron Mehta              1
1            Aamer Madhani              2
2  Aamer Madhani ; Ben Fox              1
  • Related