Home > Software engineering >  Difference among two DataFrame Columns with group by
Difference among two DataFrame Columns with group by

Time:09-28

I have this problem I can't solve: I have two DataFrames: df1: with columns (ID1, timestamp, value1) df2: with columns (ID2, timestamp, value2, ID1) In df2 the relationship between ID1 and ID2 is one to many: a certain value of ID2 corresponds only to one value of ID1, but a certain value of ID1 might be related to many values of ID2. What I want to do is to add a column to df1 which is calculated for every ID1 as the difference between the sum of the df.value2 corresponding to ID1 and value1 (value2 - value1) over every timestamp. For example, let's suppose I have one row in df1 corresponding to ID1 = 0 with value1 = 5 and two rows in df2 corresponding to ID1 = 0, but with values of ID2 equals 'b' and 'c'with respectively value2 = 4 and value2 = 3. The expected result should be: (4 3)-5.

I thought of this solution: df1['calculated'] = df2.groupby(['ID2', pd.Grouper(key='timestamp',freq='H')], as_index=True, group_keys=False)['value2'].sum()-df1.groupby(['ID1', pd.Grouper(key='timestamp',freq='H')], as_index=True, group_keys=False).sum() The problem with this is that if, for example in df1 there are ID1 that doesn't exist in df2.ID1, it return an incompatible index error. So basically the difference has to be computed only for the ID1 that exist in df2.ID1.

Can someone help me please? Thanks in advance!

EDIT: here's some code for a runnable example: input:

data_1 = {'ID1':['0', '1'], 'timestamp':[pd.Timestamp('2021-01-01-00:00:00'), pd.Timestamp('2021-01-01-00:00:00')], 'value1':[5, 7]}
data_2 = {'ID2':['3', '4'], 'timestamp':[pd.Timestamp('2021-01-01-00:00:00'), pd.Timestamp('2021-01-01-00:00:00')], 'value2':[4, 3], 'ID1':['0', '0']}

df1 = pd.DataFrame(data=data_1)
df2 = pd.DataFrame(data=data_2)

output:

ID1  timestamp  value1
0 2021-01-01       5
1 2021-01-01       7
ID2  timestamp  value2 ID1
3 2021-01-01       4   0
4 2021-01-01       3   0

for easy of use I used only one value for the timestamp, but keep it it mind that if it's different the value should be computed for every timestamp (check the pd.Grouper in the question above) So the expected result should be a column like this:

ID1  timestamp  value1  calculated
0 2021-01-01       5        2
1 2021-01-01       7        0

the 0 in the second row is because there is no ID1 = '1' in df2.

CodePudding user response:

I hope I understand your question correctly. Try this solution.

#Calculate the sum in df2 for value2
df2_group = df2.groupby(by=["ID1", "timestamp"], as_index=False).agg({"value2":"sum"}).rename(columns={"value2":"df2_sum"})
df2_group

#Merge df1 and df2
df1 = pd.merge(df1, df2_group, on=["ID1", "timestamp"], how="left")

#Calculate value and drop df2_sum columns.
df1["calculated"] = df1["df2_sum"] - df1["value1"]
df1.drop(columns=["df2_sum"], inplace=True)

#Fill value 0 to ID1 not found in ID2 
df1.calculated = df1.calculated.fillna(0)
df1
Out[178]: 
  ID1  timestamp  value1  calculated
0   0 2021-01-01       5         2.0
1   1 2021-01-01       7         0.0
  • Related