How can we divide by a count of multiple items in a dataframe?-CodePudding

I have a merged dataframe that looks like this.

import requests
import pandas as pd
from pandas import DataFrame
from matplotlib import pyplot as plt
import seaborn as sns

# Intitialise data of lists
data = [{'LocationA': '1001', 'YearA':'2013', 'Revenue':'55000', 'LocationB':'1001', 'YearB': '2013', 'Type':'Capital', 'Expense':'4000'}, 
       {'LocationA': '1001', 'YearA':'2014',  'Revenue':'195000', 'LocationB':'1001', 'YearB': '2014', 'Type':'Capital', 'Expense':'40000'},
       {'LocationA': '1001', 'YearA':'2015',  'Revenue':'10000', 'LocationB':'1001', 'YearB': '2015', 'Type':'Capital', 'Expense':'1000'}, 
       {'LocationA': '1001', 'YearA':'2015',  'Revenue':'10000', 'LocationB':'1001', 'YearB': '2015', 'Type':'Operating', 'Expense':'5000'},
       {'LocationA': '1001', 'YearA':'2016',  'Revenue':'25000', 'LocationB':'1001', 'YearB': '2016', 'Type':'Operating', 'Expense':'10000'}]
df = pd.DataFrame(data)
df

I tried to do the merge using inner, left, right, and outer. In all scenarios, I get Revenue = 10000 repeating 2x and, I guess, this happens because I have a Capital of 1000 and Operating of 5000.

My question is this...

How can I do a count of matches in multiple columns and divide by the count? In this scenario, LocationA is duped, YearA AND Revenue AND LocationB AND YearB, are dupes. These, combined, are counted 2x. I'd like to derive this 2, somehow, and divide Revenue by 2. How can I do that? Or, is the whole concept flawed?

CodePudding user response：

You can apply a custom function to get your desired result:

print (df.assign(Revenue=df["Revenue"].astype(int))
         .groupby(["LocationA", "YearA", "LocationB", "YearB"])["Revenue"]
         .apply(lambda d: d.mean()/d.count()))

LocationA  YearA  LocationB  YearB
1001       2013   1001       2013      55000.0
           2014   1001       2014     195000.0
           2015   1001       2015       5000.0
           2016   1001       2016      25000.0
Name: Revenue, dtype: float64

Note that mean was used but you can also use the first value like d.iat[0]/d.count() instead.