I got two dataframes looking like this:
df1
:
omg | odds | outcome | |
---|---|---|---|
0 | 2625 | 27 | 1 |
1 | 2625 | 22 | 1 |
2 | 2626 | 10 | 0 |
3 | 2628 | 15 | 1 |
4 | 2628 | 20 | 1 |
df2
:
omgangar | |
---|---|
0 | 2625 |
1 | 2626 |
2 | 2627 |
3 | 2628 |
4 | 2629 |
In df2
I'd like to add a new column that counts the occurrences of df2[omgangar] in df1[omg]
together with some other conditions (odds between two values and outcome equal to 1).
I've tried this code but it throws an error:
df2['outcome'] =
sum((df1['omg'] == df2['omgangar']) & (df1['odds'].between(20,30)) & (df1['outcome'] == 1))
Error:
ValueError: Can only compare identically-labeled Series objects
The desired output of df2
is:
omgangar | outcome | |
---|---|---|
0 | 2625 | 2 |
1 | 2626 | 0 |
2 | 2627 | 0 |
3 | 2628 | 1 |
4 | 2629 | 0 |
CodePudding user response:
There is no need to use countifs, instead apply the filters first and then merge the dataframes.
First filter df1
with the two conditions, then groupby
on omg
and aggregate using count
as follows:
s = df1.loc[df['odds'].between(20,30) & (df['outcome'] == 1)].groupby('omg')['odds'].count()
This will return a series s
which looks like this:
omg
2625 2
2628 1
Name: odds, dtype: int64
Now, merge
s
with df2
and fill any resulting NaN values with 0
:
df2.merge(df, left_on='omgangar', right_index=True, how='left').fillna(0.0)
Result:
omgangar odds
0 2625 2.0
1 2626 0.0
2 2627 0.0
3 2628 1.0
4 2629 0.0
CodePudding user response:
My suggestion would be to use merge and groupby for this task:
import pandas as pd
df1 = pd.DataFrame({
'omg': [2625, 2625, 2626, 2628, 2628],
'odds': [27, 22, 10, 15, 20],
'outcome': [1, 1, 0, 1, 1]})
df2 = pd.DataFrame({'omgangar': list(range(2625, 2630))})
df = df2.merge(df1, left_on = 'omgangar', right_on = 'omg', how='left')
df['outcome'] = (df.outcome == 1) & (df.odds.between(20, 30))
df.groupby('omgangar')[['outcome']].agg('sum')
outcome
omgangar
2625 2
2626 0
2627 0
2628 1
2629 0
CodePudding user response:
Use DataFrame.loc
for filter by mask with column omg
and get counts by Series.value_counts
, for new column use Series.map
with replace missing not matched values to 0
:
s = df1.loc[df1['odds'].between(20,30) & (df1['outcome'] == 1), 'omg'].value_counts()
df2['outcome'] = df2['omgangar'].map(s).fillna(0).astype(int)
print (df2)
omgangar outcome
0 2625 2
1 2626 0
2 2627 0
3 2628 1
4 2629 0