Here's an example of my dataframe:
d = {'group': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'd', 'd'], \
'round': [3, 3, 2, 1, 3, 1, 3, 3, 3, 2, 1], \
'score': [0.3, 0.1, 0.6, 0.8, 0.2, 0.5, 0.5, 0.6, 0.4, 0.9, 0.1]}
df = pd.DataFrame(d)
df
group round score
0 a 3 0.3
1 a 3 0.1
2 a 2 0.6
3 b 1 0.8
4 b 3 0.2
5 b 1 0.5
6 b 3 0.5
7 b 3 0.6
8 c 3 0.4
9 d 2 0.9
10 d 1 0.1
My actual dataframe has 6 columns and > 1,000,000 rows. I'm trying to figure out the fastest way to do the following:
For each group find the average of scores and perform some calculation with it for each of 3 rounds. If there are no scores, write 'NA'.
I'm not sure if it would be faster to make a list of lists and then convert it into a dataframe or make a new dataframe and populate that, so i went with the list first:
def test_df(data):
value_counts = data['group'].value_counts().to_dict()
avgs = []
for key, val in value_counts.items():
row = data[data['group'] == key]
x = [key]
if val < 2:
x.extend([10 * row['score'].values[0] 1 if i == row['round'].values[0] else 'NA' for i in range (1,4)])
else:
x.extend([(10 * row[row['round'] == i]['score'].mean() 1) if len(row[row['round'] == i]) > 0 else 'NA' for i in range(1, 4)])
avgs.append(x)
return avgs
Here I created a separate case because about 80% of groups in my data only have one row, so I figured it might speed things up maybe?
this returns the correct results in format [group, round 1, round 2, round 3]
[['b', 7.5, 'NA', 5.333333333333333],
['a', 'NA', 7.0, 3.0],
['d', 2.0, 10.0, 'NA'],
['c', 'NA', 'NA', 5.0]]
but it's looking like it's going to take a really really long time on the actual dataframe... Does anyone have any better ideas?
CodePudding user response:
It looks to me like you're basically going a groupby/mean and a pivot.
import pandas as pd
d = {'group': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'd', 'd'], \
'round': [3, 3, 2, 1, 3, 1, 3, 3, 3, 2, 1], \
'score': [0.3, 0.1, 0.6, 0.8, 0.2, 0.5, 0.5, 0.6, 0.4, 0.9, 0.1]}
df = pd.DataFrame(d)
df = (df.groupby(['group','round'])['score'].mean()*10 1).reset_index()
df.pivot_table(index='group',columns='round',values='score', fill_value='NA').reset_index().values
Output
array([['a', 'NA', 7.0, 3.0],
['b', 7.5, 'NA', 5.333333333333333],
['c', 'NA', 'NA', 5.0],
['d', 2.0, 10.0, 'NA']], dtype=object)
CodePudding user response:
The imbalanced dataset may show different results, but I tested with the blow scripts and found out even with the pandas dataframe, the result shows okay performance. However, you can always compare it with the native python data structure.
import random
import datetime
import pandas as pd
def generate_data(): # augmentation
data = {'group': [], 'round': [], 'score': []}
for index in range(10 ** 6): # sample size
data['group'].append(random.choice(['a', 'b', 'c', 'd']))
data['round'].append(random.randrange(1, 4))
data['score'].append(round(random.random(), 1))
return data
def calc_with_native_ds(data): # native python data structure
pass
def calc_with_pandas_df(df): # pandas dataframe
return df.groupby(['group', 'round']).mean()
if __name__ == '__main__':
data = generate_data()
df = pd.DataFrame(data)
print(df.shape)
start_datetime = datetime.datetime.now()
# calc_with_native_ds(data)
calc_with_pandas_df(df)
end_datetime = datetime.datetime.now()
elapsed_time = round((end_datetime - start_datetime).total_seconds(), 5)
print(f"elapsed_time: {elapsed_time}")