I have the following dataframe in Pandas:
name | value | in | out |
---|---|---|---|
A | 50 | 1 | 0 |
A | -20 | 0 | 1 |
B | 150 | 1 | 0 |
C | 10 | 1 | 0 |
D | 500 | 1 | 0 |
D | -250 | 0 | 1 |
E | 800 | 1 | 0 |
There are maximally only 2 observations for each name
: one for in
and one for out
.
If there is only in
for a name
there is only one observation for it.
You can create this dataset with this code:
data = {
'name': ['A','A','B','C','D','D','E'],
'values': [50,-20,150,10,500,-250,800],
'in': [1,0,1,1,1,0,1],
'out': [0,1,0,0,0,1,0]
}
df = pd.DataFrame.from_dict(data)
I want to sum the value
column for each name
but only if name
has both in
and out
record. In other words, only when one unique name
has exactly 2 rows.
The result should look like this:
name | value |
---|---|
A | 30 |
D | 250 |
If I run the following code I got all the results without filtering based on in
and out
.
df.groupby('name').sum()
name | value |
---|---|
A | 30 |
B | 150 |
C | 10 |
D | 250 |
E | 800 |
How to add the beforementioned filtering based on columns?
CodePudding user response:
IIUC, you could filter before aggregation:
# check that we have exactly 1 in and 1 out per group
mask = df.groupby('name')[['in', 'out']].transform('sum').eq([1,1]).all(1)
# slice the correct groups and aggregate
out = df[mask].groupby('name', as_index=False)['values'].sum()
Or, you could filter afterwards (maybe less efficient if you have a lot of groups that would be filtered out):
(df.groupby('name', as_index=False).sum()
.loc[lambda d: d['in'].eq(1) & d['out'].eq(1), ['name', 'values']]
)
output:
name values
0 A 30
1 D 250
CodePudding user response:
Maybe you can try something with groupby
, agg
, and query
(like below):
df.groupby('name').agg({'name':'count', 'values': 'sum'}).query('name>1')[['values']]
Output:
values
name
A 30
D 250
You could also make .query('name==2')
in above if you like but assuming it can occur max at 2 .query('name>1')
would also return same.