I'm trying to combine three pandas DataFrames together
One of them (called major
) has a column category
where each row has a unique label :
major_df = pd.DataFrame(np.random.randint(0, 100, size=(3, 2)), columns=list("AB"))
major_df["category"] = pd.Series(["cat_A", "cat_B", "cat_C"])
A B category
0 90 17 cat_A
1 36 81 cat_B
2 90 67 cat_C
Two other dfs (called minor) contains multiple rows and have their own unique column names. Each df has a column
category` where each row has a value that is present in the major df category column :
minor_dfs = {}
for k, cols in zip(("1st", "2nd"), ("CD", "EF")):
minor_dfs[k] = pd.DataFrame(np.random.randint(0, 100, size=(8, 2)), columns=list(cols))
minor_dfs[k]["category"] = np.random.choice(["cat_A", "cat_B", "cat_C"], 8)
Here is an example of one of those minor dfs. The only difference between both is that first minor df has the columns C
and D
, where the second has columns E
and F
.
C D category
0 71 44 cat_C
1 5 88 cat_C
2 8 78 cat_C
3 31 27 cat_C
4 42 48 cat_B
5 18 18 cat_B
6 84 23 cat_A
7 94 23 cat_A
So, my goal is to compute the mean of the values in minor dfs based on the category column, so that at the end, I have the following dfs :
C D
cat_A 89.00 23.00
cat_B 30.00 33.00
cat_C 28.75 59.25
where each column contain the mean of the values that are in each category.
For that, I made the following code, where we create empty DataFrames with the column values of the minor dfs and indices from the different values of categories. I then fill this dataframe using a for loop where I iterate over every value of the index.
copy_dfs = {}
for k, min_df in minor_dfs.items():
# Get columns from minor df
# Get index from category of major df
col_names = min_df.columns.values
ind_values = major_df.category.values
# Create a df with columns and indices and set values to np.nan
copy_df = pd.DataFrame(np.nan, index=ind_values, columns=col_names)
copy_df = copy_df.drop("category", axis=1)
# For each category in the index of the dataframe
for maj_category in copy_df.index:
# Select rows in minor df where category is the same as major df category
minor_rows = min_df[min_df.category == maj_category]
minor_rows = minor_rows.drop("category", axis=1)
# Compute the mean values (by column) of the rows that were selected
# Add the mean values into copy_df, where the index corresponds to major df category
copy_df.loc[maj_category] = minor_rows.mean()
# Store into dict
copy_dfs[k] = copy_df
Yet, I think that this code could be optimized using vectorized operations, especially in the part where I iterate for each row. So I was wondering if there was a easier and clever way to accomplish what I'm trying to do ?
CodePudding user response:
This?
import pandas as pd
df = pd.read_excel('test.xlsx')
df1 = df.groupby(['category']).mean()
print(df)
print(df1)
output:
C D category
0 71 44 cat_C
1 5 88 cat_C
2 8 78 cat_C
3 31 27 cat_C
4 42 48 cat_B
5 18 18 cat_B
6 84 23 cat_A
7 94 23 cat_A
C D
category
cat_A 89.00 23.00
cat_B 30.00 33.00
cat_C 28.75 59.25