Home > Enterprise >  Sum values from two columns in the group if not duplicate
Sum values from two columns in the group if not duplicate

Time:01-07

I have pandas dataframes that look similar to this:

df = pd.DataFrame({"week": [1, 1, 1, 1, 1],
                   "area1_code1": ["A", "A", "A", "A", "C"],
                   "area1_code2": ["A1", "A1", "A2", "A2", "C1"],
                   "area1_member": [10, 10, 8, 8, 2],
                   "area2_code1": ["B", "B", "B", "B", "D"],
                   "area2_code2": ["B1", "B2", "B1", "B2", "D1"],
                   "area2_member": [3, 3, 3, 3, 6]})
week area1_code1 area1_code2 area1_member area2_code1 area2_code2 area2_member
1 A A1 10 B B1 3
1 A A1 10 B B2 3
1 A A2 8 B B1 3
1 A A2 8 B B2 3
1 C C1 2 D D1 6

I want to group the data by area1_code1 and area2_code1 or by week then find the sum of area1_member and area2_member of all unique area1_code2 and area2_code2 in the group.

Desired output by area1_code1 and area2_code1:

week area1_code1 area2_code1 members
1 A B 24
1 C D 8

Desired output by week:

week members
1 32

I have tried the following which did not give me the desired result by area1_code1 and area2_code1 though it produced the correct result by week.

area1 = df[["week", "area1_code1", "area1_code2", "area1_member"]].drop_duplicates(["week", "area1_code2"])
area1.rename(columns={"area1_code1": "area_code1",
                      "area1_code2": "area_code2",
                      "area1_member": "area_member"}, inplace=True)
area2 = df[["week", "area2_code1", "area2_code2", "area2_member"]].drop_duplicates(["week", "area2_code2"])
area2.rename(columns={"area2_code1": "area_code1",
                      "area2_code2": "area_code2",
                      "area2_member": "area_member"}, inplace=True)
result = pd.concat([area1, area2]).drop_duplicates().reset_index(drop=True)
week area_code1 area_code2 area_member
1 A A1 10
1 A A2 8
1 C C1 2
1 B B1 3
1 B B2 3
1 D D1 6
result_week = result.groupby("week")["area_member"].sum().reset_index()
week area_member
1 32

I have to do this for hundreds of dataframes with more than a million rows. What would be the most efficient solution to this problem?

Thanks!

CodePudding user response:

seems like you wanna do this:

df = df.drop_duplicates(subset=['week', 'area1_code1', 'area2_code1', 'area1_code2'])
df['all_members'] = df['area1_member']   df['area2_member']
result = df.groupby(['week', 'area1_code1', 'area2_code1'])['all_members'].sum().reset_index()
result_week = df.groupby(['week'])['all_members'].sum().reset_index()

output :

   week area1_code1 area2_code1  all_members
0     1           A           B           24
1     1           C           D            8

   week  all_members
0     1           32

CodePudding user response:

You can do:

member_cols = ["area1_member", "area2_member"]
g_cols = ["week", "area1_code1", "area2_code1"]

#Compute sum across axis 1 for member_cols and assign a column and drop duplicates
t = df.assign(area_member=df[member_cols].sum(1)).drop_duplicates(
    [*g_cols, "area_member"], ignore_index=True
)

#Then groupby the groupby cols and just sum the "area_member"
out_1 = t.groupby(g_cols, as_index=False)["area_member"].sum()

#Then groupby week and again sum
out_2 = out_1.groupby("week", as_index=False)["area_member"].sum()

print(out_1)
   week area1_code1 area2_code1  area_member
0     1           A           B           24
1     1           C           D            8

print(out_2)
    week  area_member
0     1           32
  • Related