I have a large dataset that I am trying to perform various analyses on, but first need to transform into matrices grouped by different variables.
For example, here is a toy dataset:
myData = pd.DataFrame({'dataset': ['cat', 'cat', 'cat', 'cat', 'dog', 'dog', 'dog', 'dog', 'bird', 'bird', 'bird', 'bird'],
'category_1': ['orange', 'orange', 'white', 'white', 'black', 'brown', 'brown', 'black', 'red', 'green', 'red', 'green'],
'category_2': ['this_cat', 'that_cat', 'this_cat', 'that_cat', 'this_dog', 'that_dog', 'this_dog', 'that_dog', 'this_bird', 'that_bird', 'this_bird', 'that_bird'],
'values': ['1', '8', '9', '2', '5', '4', '3', '10', '0', '2', '7', '9']
})
for i, animals in myData.groupby('dataset'):
tuples = animals.groupby(['category_1', 'category_2'])['values'].mean().reset_index()
tuples = pd.DataFrame(tuples)
matrix = tuples.pivot(index='category_2', columns='category_1', values='values').reset_index()
display(matrix)
Here I am grouping my data by "animals" and converting each group into a matrix. However, because the column names are not same across my matrices, I am having trouble saving my output into an external empty list or dataframe.
For example, I'd like to save each matrix into a separate dataframe that is dynamically generated depending on the number of groups in my data:
output_dfs = {k: pd.DataFrame([]) for k in myData['dataset']}
Desired output in this case would be 3 separate dataframes that I can access by a name: (the values are based on the toy dataset)
dataset category_1 category_2 green red
bird 0 that_bird 14.5 NaN
bird 1 this_bird NaN 3.5
dataset category_1 category_2 orange white
cat 0 that_cat 8.0 2.0
cat 1 this_cat 1.0 9.0
dataset category_1 category_2 black brown
dog 0 that_dog 10.0 4.0
dog 1 this_dog 5.0 3.0
CodePudding user response:
I'm not sure what you mean, is this the result you want to achieve?
myData = pd.DataFrame({'dataset': ['cat', 'cat', 'cat', 'cat', 'dog', 'dog', 'dog', 'dog', 'bird', 'bird', 'bird', 'bird'],
'category_1': ['orange', 'orange', 'white', 'white', 'black', 'brown', 'brown', 'black', 'red', 'green', 'red', 'green'],
'category_2': ['this_cat', 'that_cat', 'this_cat', 'that_cat', 'this_dog', 'that_dog', 'this_dog', 'that_dog', 'this_bird', 'that_bird', 'this_bird', 'that_bird'],
'values': ['1', '8', '9', '2', '5', '4', '3', '10', '0', '2', '7', '9']
})
result = {}
for i, animals in myData.groupby('dataset'):
tuples = animals.groupby(['category_1', 'category_2'])['values'].mean().reset_index()
tuples = pd.DataFrame(tuples)
matrix = tuples.pivot(index='category_2', columns='category_1', values='values').reset_index()
result[i] = matrix
display(result)