Home > Software engineering >  Sum different rows in data frame based on two columns values
Sum different rows in data frame based on two columns values

Time:07-15

For the group where variable 1 is True. I want to sum the column Number in the row where variable 2 is false to the other two rows where variable 2 is True.

   Variable 1  Variable 2  Number
0        True        True      10
1        True        True      20
2        True       False      30
3       False       False       5

Expected output:

   Variable 1  Variable 2  Number
0        True        True      40
1        True        True      50
2        True       False      30
3       False       False       5

If it's possible I would like the code to work for multiple groups where variable 1 is true.

CodePudding user response:

No need for a loop or anything complex. Use simple boolean indexing for in place modification:

m1 = df['Variable 1']
m2 = df['Variable 2']

# add to the rows where m1 AND m2
# the sum of rows where m1 AND NOT m2
df.loc[m1&m2, 'Number']  = df.loc[m1&~m2, 'Number'].sum()

Output:

   Variable 1  Variable 2  Number
0        True        True      40
1        True        True      50
2        True       False      30
3       False       False       5

CodePudding user response:

Hope this what you are expecting, if you run the below code you will get the expected output.

import pandas as pd

data={'Variable 1':[True,True,True,False],'Variable 2':[True,True,False,False], 'Number': [10,20,30,5]}
stack = []
for i in range(len(data['Variable 1'])):
    if data['Variable 1'][i] and data['Variable 2'][i]:
        stack.append((i, data['Number'][i]))
    elif  data['Variable 1'][i] and not  data['Variable 2'][i]:
        while len(stack) > 0:
            a,b = stack.pop()
            data['Number'][a] = data['Number'][a]    data['Number'][i]  

df=pd.DataFrame(data)

print(df)

Please feel free to comment if this is not your case. Thanks and hope this helps.

CodePudding user response:

It was a little bit hard to understand what you want...
my solution is very explicity...
I Tried to guide you step by step...

My code

import pandas as pd

# your sample data as a dict
data={'Variable 1':[True,True,True,False],'Variable 2':[True,True,False,False], 'Number': [10,20,30,5]}

# create a dataframe from created dict
df = pd.DataFrame.from_dict(data)

# replace empty spaces from name columns so you can use DOT NOTATION
df.columns = df.columns.str.replace(' ', '_').str.lower()

# What to do in case at least one is false
df['both_false'] = 0   df.number[(df.variable_1 == False) | (df.variable_2 == False)]
df['both_false'] = df['both_false'].fillna(0)

# What to do in case Both True
df['both_true'] = 0   df.number[(df.variable_1 == True) & (df.variable_2 == True)].sum()

# Calc 1
df['result_1'] = df.loc[(df.variable_1 == True) & (df.variable_2 == True), 'number']   df.loc[(df.variable_1 == True) & (df.variable_2 == True), 'both_true']  
df['result_1'] = df['result_1'].fillna(0)

# Calc 2
df['result_2'] = df.loc[(df.variable_1 == False) | (df.variable_2 == False), 'number'] 
df['result_2'] = df['result_2'].fillna(0)

# Final result in both cases
df['result'] = df.result_1   df.result_2

# just drop unecessary series 
df.drop(['both_false', 'both_true', 'result_1', 'result_2'], axis=1, inplace=True)

Output

- variable_1 variable_2 number result
0 True True 10 40.0
1 True True 20 50.0
2 True False 30 30.0
3 False False 5 5.0

Hope that helps you

  • Related