I have pandas dataframes that look similar to this:
df = pd.DataFrame({"week": [1, 1, 1, 1, 1],
"area1_code1": ["A", "A", "A", "A", "C"],
"area1_code2": ["A1", "A1", "A2", "A2", "C1"],
"area1_member": [10, 10, 8, 8, 2],
"area2_code1": ["B", "B", "B", "B", "D"],
"area2_code2": ["B1", "B2", "B1", "B2", "D1"],
"area2_member": [3, 3, 3, 3, 6]})
week | area1_code1 | area1_code2 | area1_member | area2_code1 | area2_code2 | area2_member |
---|---|---|---|---|---|---|
1 | A | A1 | 10 | B | B1 | 3 |
1 | A | A1 | 10 | B | B2 | 3 |
1 | A | A2 | 8 | B | B1 | 3 |
1 | A | A2 | 8 | B | B2 | 3 |
1 | C | C1 | 2 | D | D1 | 6 |
I want to group the data by area1_code1 and area2_code1 or by week then find the sum of area1_member and area2_member of all unique area1_code2 and area2_code2 in the group.
Desired output by area1_code1 and area2_code1:
week | area1_code1 | area2_code1 | members |
---|---|---|---|
1 | A | B | 24 |
1 | C | D | 8 |
Desired output by week:
week | members |
---|---|
1 | 32 |
I have tried the following which did not give me the desired result by area1_code1 and area2_code1 though it produced the correct result by week.
area1 = df[["week", "area1_code1", "area1_code2", "area1_member"]].drop_duplicates(["week", "area1_code2"])
area1.rename(columns={"area1_code1": "area_code1",
"area1_code2": "area_code2",
"area1_member": "area_member"}, inplace=True)
area2 = df[["week", "area2_code1", "area2_code2", "area2_member"]].drop_duplicates(["week", "area2_code2"])
area2.rename(columns={"area2_code1": "area_code1",
"area2_code2": "area_code2",
"area2_member": "area_member"}, inplace=True)
result = pd.concat([area1, area2]).drop_duplicates().reset_index(drop=True)
week | area_code1 | area_code2 | area_member |
---|---|---|---|
1 | A | A1 | 10 |
1 | A | A2 | 8 |
1 | C | C1 | 2 |
1 | B | B1 | 3 |
1 | B | B2 | 3 |
1 | D | D1 | 6 |
result_week = result.groupby("week")["area_member"].sum().reset_index()
week | area_member |
---|---|
1 | 32 |
I have to do this for hundreds of dataframes with more than a million rows. What would be the most efficient solution to this problem?
Thanks!
CodePudding user response:
seems like you wanna do this:
df = df.drop_duplicates(subset=['week', 'area1_code1', 'area2_code1', 'area1_code2'])
df['all_members'] = df['area1_member'] df['area2_member']
result = df.groupby(['week', 'area1_code1', 'area2_code1'])['all_members'].sum().reset_index()
result_week = df.groupby(['week'])['all_members'].sum().reset_index()
output :
week area1_code1 area2_code1 all_members
0 1 A B 24
1 1 C D 8
week all_members
0 1 32
CodePudding user response:
You can do:
member_cols = ["area1_member", "area2_member"]
g_cols = ["week", "area1_code1", "area2_code1"]
#Compute sum across axis 1 for member_cols and assign a column and drop duplicates
t = df.assign(area_member=df[member_cols].sum(1)).drop_duplicates(
[*g_cols, "area_member"], ignore_index=True
)
#Then groupby the groupby cols and just sum the "area_member"
out_1 = t.groupby(g_cols, as_index=False)["area_member"].sum()
#Then groupby week and again sum
out_2 = out_1.groupby("week", as_index=False)["area_member"].sum()
print(out_1)
week area1_code1 area2_code1 area_member
0 1 A B 24
1 1 C D 8
print(out_2)
week area_member
0 1 32