Home > front end >  Code optimalisation for groupby in python
Code optimalisation for groupby in python

Time:10-23

I'm looking to optimalize some python code, but I'm not sure on how to approach the problem since I've used python mainly for analysing data and have limited hardcoding skills, so any input is welcome.

My data looks like this:

X   Y           Stock           Number
A   10-20       id1             5
A   30-40       id2             7
A   0-10        id3             18
B   30-40       id4             3
B   10-20       id5             9
C   10-20       id6             11
C   0-10        id7             9

I use a groupby to analyse the data in this instance:

# Groupby
=df.groupby(['x', 'y']).agg({'stock':'count','number':'mean'}).reset_index().persist()
groupby.columns=['x', 'y', 'total_stocks', 'mean_number']
# Calculate proportions
groupby['stock_sum'] = df.groupby('y')['total_stocks'].transform('sum')
groupby['proportion'] = groupby['total_stocks'] / groupby['stock_sum']

Now, I have several more variables like 'X' in the dataset (let's call them U, V, W,...), for which I would like to repeat this groupby-element. I know the basics of loops and functions, and I imagine that I could make a list of ['X','U', 'V', 'W'] and then use a funtion to do the groupby, but I'm struggeling to imagine how I should incorporate the list (and loop over the items in it) in the function.

CodePudding user response:

I hope I am understanding your question correctly. I built a simple example replacing X with your desired variables. This basic premise could be applied to add more variables to your loop.

var_list = ['X', 'U', 'V', 'W'] # list of variables
for item in var_list:
  df.groupby([item,'y']).agg({'stock':'count','number':'mean'}).reset_index().persist()
  groupby.columns=[item, 'y', 'total_stocks', 'mean_number']
  # Calculate proportions
  groupby['stock_sum'] = df.groupby('y')['total_stocks'].transform('sum')
  groupby['proportion'] = groupby['total_stocks'] / groupby['stock_sum']
  • Related