How to group a DataFrame and divide rows?-CodePudding

I need to group a dataframe by "Day" and divide the values from column "Count" ("Clicked"/"Delivered")

Day	Status	Channel	Count
2022-11-08	Delivered	SMS	288322
2022-11-08	Clicked	SMS	1132
2022-11-13	Delivered	SMS	246203
2022-11-13	Clicked	SMS	4684
2022-11-19	Delivered	SMS	154872
2022-11-19	Clicked	SMS	1222
2022-11-20	Delivered	SMS	253716
2022-11-20	Clicked	SMS	2384

I used pandas .groupby() to group by "Day", but i don´t know how to specify to divide "Count". Like if I needed to sum values I'd use .sum(), but there isn´t a .divide() unfortunally.

CodePudding user response：

1. Using `.unstack()` (when one row of each type exists)

You don't need groupby for this if you only have 1 clicked and 1 delivered value per date (as your example suggests).

You can use .unstack() to get the result you need. It doesn't need any sorting of rows as well, only that there are 2 rows per date (one for Clicked, and other for Delivered). Here is how the 2-3 line code below works -

Set Day and Status as multiindex
Unstack the Status index to get 2 columns, Clicked and Delivered
Simply divide these 2 columns after that and reset index

s = df.set_index(['Day','Status'])['Count'].unstack() #<-- Unstack status index
s['Ratio'] = s['Clicked']/s['Delivered']              #<-- Divide clicked & delivered
s = s.reset_index()                                   #<-- Reset index
print(s)

Status         Day  Clicked  Delivered     Ratio
0       2022-11-08     1132     288322  0.003926
1       2022-11-13     4684     246203  0.019025
2       2022-11-19     1222     154872  0.007890
3       2022-11-20     2384     253716  0.009396

2. Using `.groupby()` `.unstack()` (when additional rows of each type exist)

Just to be on the safer side, if you do have multiple rows for "clicked" or "delivered" respectively for a given date (for example, 2 clicked rows and 4 delivered rows for the same date), you can sum the quantities to get a single clicked and a single delivered row, using groupby and then do the same process as above.

Group by Day and Status columns and take sum of counts
Unstack the Status index to get 2 columns, Clicked and Delivered
Simply divide these 2 columns after that and reset index

g = df.groupby(['Day','Status'])['Count'].sum().unstack() #<-- Unstack status index
g['Ratio'] = g['Clicked']/g['Delivered']                  #<-- Divide clicked & delivered
g = g.reset_index()                                       #<-- Reset index
print(g)

Status         Day  Clicked  Delivered     Ratio
0       2022-11-08     1132     288322  0.003926
1       2022-11-13     4684     246203  0.019025
2       2022-11-19     1222     154872  0.007890
3       2022-11-20     2384     253716  0.009396

CodePudding user response：

You may group by two columns at the same time.
As far as I understand you've already done smth like this:

df.groupby(df.Day)

Right?

You may do:

df.groupby([df.Day, df.Status])

df.groupby([df.Day, df.Status]).indices:

{('2022-11-08', 'Clicked'): array([1], dtype=int64),
 ('2022-11-08', 'Delivered'): array([0], dtype=int64),
 ('2022-11-13', 'Clicked'): array([3], dtype=int64),
 ('2022-11-13', 'Delivered'): array([2], dtype=int64),
 ('2022-11-19', 'Clicked'): array([5], dtype=int64),
 ('2022-11-19', 'Delivered'): array([4], dtype=int64),
 ('2022-11-20', 'Clicked'): array([7], dtype=int64),
 ('2022-11-20', 'Delivered'): array([6], dtype=int64)}

CodePudding user response：

Here is one way to do it with GroupBy.agg :

out = (
        df.groupby(["Day", "Channel"],
                   as_index=False, sort=False)
            ["Count"].agg(["first", "last"])
            .assign(result= lambda x: x.pop("last").div(x.pop("first")))
            # .assign(result= lambda x: x["last"].div(x["first"])) #if you need to preserve clicked/delivered values
            .reset_index()
)

# Output :

print(out)

          Day Channel    result
0  2022-11-08     SMS  0.003926
1  2022-11-13     SMS  0.019025
2  2022-11-19     SMS  0.007890
3  2022-11-20     SMS  0.009396

NB : You need to make sure that the rows are sorted with Delivered/Clicked.

1. Using .unstack() (when one row of each type exists)

2. Using .groupby() .unstack() (when additional rows of each type exist)

# Output :

1. Using `.unstack()` (when one row of each type exists)

2. Using `.groupby()` `.unstack()` (when additional rows of each type exist)