Home > Software design >  Count the values after an underscore in a Pandas Series
Count the values after an underscore in a Pandas Series

Time:11-03

I have the following Dataframe:

ID event_1_positive event_2_negative event_3_neutral event_4_negative event_5_positive
1 1 1 0 1 0
2 1 0 0 0 1
3 0 1 1 1 0
4 1 1 0 0 0
5 0 0 1 1 1
6 0 1 1 0 1
7 1 1 1 0 0
8 0 1 0 0 0
9 1 1 1 1 0
10 0 0 0 0 1

What I'd like to do is to order the events by its effect (positive, negative or neutral) and return the count for each effect.

In this case, my expected output would be:

0   negative   11
1   positive   9
2   neutral    5

CodePudding user response:

Use filter and groupby:

>>> df.filter(like='event') \
      .groupby(lambda x: x.rsplit('_', 1)[1], axis=1) \
      .sum().sum()

negative    11
neutral      5
positive     9
dtype: int64

CodePudding user response:

You could create new columns where you add all the ones with similar effects, e.g. df["negative"] = df["event_2_negative"] df["event_4_negative"] and then delete the other columns and then do a df["negative"].value_counts() for example or df["negative"].sum()

CodePudding user response:

This is a rare case where accounting in a loop seems practical as there's a manageable count of columns

# start a dictionary to pack with results
categories = {k: 0 for k in ("negative", "positive", "neutral")}

for colname in df:  # dataframes are iterable by-column name!
    if "_" not in colname:
        continue  # or consider alerting on this condition
    for category in categories.keys():
        if category in colname:  # substring compare
            categories[category]  = df[colname].sum()
            break  # successfully found category -> next colname
    else:  # didn't find and break
        raise ValueError(f"no category for {colname}")

# create a collection of results
df_results = pd.Series(categories).to_frame().reset_index()
    index      0
0   negative   11
1   positive   9
2   neutral    5

CodePudding user response:

One option is to reshape the data, where you have all the negatives in one column, same for positives and neutral. Your columns have a pattern (some end in positives, some columns end in neutral, ...).

We can use that pattern in pivot_longer from pyjanitor to reshape the data, then aggregate(reduce):

# pip install git https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
(df.pivot_longer(index='ID', 
                 names_to=('positive', 'negative', 'neutral'), 
                 names_pattern=('positive', 'negative', 'neutral'))
   .iloc[:, 1:]
   .sum()
) 
positive     9.0
negative    11.0
neutral      5.0
dtype: float64

CodePudding user response:

df. \
drop("ID", axis=1). \
rename(columns=lambda x: x.split("_")[2]). \
unstack(). \
droplevel(1). \
reset_index(). \
groupby("index"). \
sum()

#           0
# index     
# negative  11
# neutral   5
# positive  9
  • Related