Home > Software engineering >  Aggregate by unique values & their counts using pandas
Aggregate by unique values & their counts using pandas

Time:12-01

I have a df:

# create generic df with 1 date column and 2 value columns
df = pd.DataFrame({'date': pd.date_range('2020-01-01', '2020-01-31', freq='D'), \
    'value1': np.random.randint(0, 10, 31), \
    'value2': np.random.randint(0, 100, 31),\
    'value3': np.random.randint(0, 1000, 31)})

I want to group by this df by date in W intervals, take the average of value2, count of value3 and distinct values of value1 & the count of those values in this or similar format :

{9:2, 4:1, 6:2, 5:1, 3:1}

[(9, 2), (4,1), (6,2), (5,1), (3,1)]

Basically this represent that in the first week there were 2 counts of value 9 in the column value1 and so on, similar to what df.groupby(pd.Grouper(key='date', freq='W')).value1.value_counts() returns, but trying

df.groupby(pd.Grouper(key='date', freq='W'))\
    .agg({'value1': 'mean', 'value2': 'mean', 'value3': pd.Series.value_counts()})\
    .reset_index()

Returns an error:

TypeError: value_counts() missing 1 required positional argument: 'self'

My desired output should look like this:

date        value2      value3      value_1
2020-01-05  62.600000   5           {1:5, 3:2}
2020-01-12  30.000000   7           {2:2, 3:3, 6:1}
2020-01-19  34.428571   7           {2:2, 3:3, 6:1}
2020-01-26  51.428571   7           {2:1, 4:3, 8:1}
2020-02-02  48.000000   5           {2:1, 3:5, 7:1}

The column value1 as mentioned above can have a different format, such as a list with tuples of values.

CodePudding user response:

Convert values to dictionaries with lambda function:

df = df.groupby(pd.Grouper(key='date', freq='W'))\
    .agg({'value1': 'mean', 'value2': 'mean', 
          'value3': lambda x: x.value_counts().to_dict()})\
    .reset_index()
print (df)
        date    value1     value2  \
0 2020-01-05  3.200000  41.000000   
1 2020-01-12  4.714286  58.714286   
2 2020-01-19  4.285714  65.285714   
3 2020-01-26  6.428571  68.857143   
4 2020-02-02  4.000000  36.600000   

                                              value3  
0           {984: 1, 920: 1, 853: 1, 660: 1, 101: 1}  
1  {421: 1, 726: 1, 23: 1, 408: 1, 398: 1, 493: 1...  
2  {176: 1, 209: 1, 180: 1, 566: 1, 280: 1, 570: ...  
3  {49: 1, 113: 1, 327: 1, 777: 1, 59: 1, 301: 1,...  
4           {113: 1, 983: 1, 181: 1, 239: 1, 839: 1}  

Or use collections.Counter:

from collections import Counter
        
df = df.groupby(pd.Grouper(key='date', freq='W'))\
    .agg({'value1': 'mean', 'value2': 'mean', 'value3': Counter})\
    .reset_index()
  • Related