How to count values in one dataframe matching the key from another?-CodePudding

I want to count values in one dataframe matching the key from another.

What do I have:

df_a

df_a = pd.DataFrame(data = {'keys':['key1', 'key2', 'key3'], 'total':[0, 0, 0], '>5':''})
df_a

output:

    keys    total   >5
0   key1    0   
1   key2    0   
2   key3    0

df_b

df_b = pd.DataFrame(data = {'keys':['key1', 'key1', 'key2', 'key2', 'key3'], 'value':[3, 7, 8, 4, 10]})
df_b

output:

    keys    value
0   key1    3
1   key1    7
2   key2    8
3   key2    4
4   key3    10

What do I expect:

I want to count, how many values in df_b match to each key from df_a. Also, I want to count, which part of them are more than 5. As a result, I want to fill my df_a like this:

    keys    total   >5
0   key1    2       0.5
1   key2    2       0.5
2   key3    1       1.0

What have I done

I've tried to iterate over column keys in df_b and keys in df_a, check if it matches and use local index than. But it's finished with an error.

for key_a in df_a['keys']:
    for key_b in df_b['keys']:
        if key_a == key_b:
            row_index = df_b.iloc['keys'][key_b]
            df_a['total'][row_index]  = 1

output:

TypeError: Cannot index by location index with a non-integer key

I know that it's very stupid way of solving my problem. Can you help me, please? What should I do to make it work correctly?

CodePudding user response：

Create new column by compare greater like 5, then aggregate size and mean and join to df_a:

df = (df_b.assign(tmp = df_b['value'].gt(5))
          .groupby('keys')
          .agg(**{'total':('tmp','size'),'>5':('tmp','mean')}))
print (df)
      total   >5
keys            
key1      2  0.5
key2      2  0.5
key3      1  1.0


df = df_a[['keys']].join(df, on='keys')
print (df)
   keys  total   >5
0  key1      2  0.5
1  key2      2  0.5
2  key3      1  1.0

CodePudding user response：

Remove '>5' and 'total' from df_a. that's unnecessary.

df_b['>5'] = df_b['value'].apply(lambda x: 1 if x > 5 else None)

aggs = {'keys': 'first', 'value': 'count', '>5': 'count'}
df_b_new = df_b.groupby('keys').agg(aggs).reset_index(drop=True)

df = pd.merge(df_a, df_b_new, on='keys', how='left')
df['>5'] = df['>5'] / df['value']

OUTPUT:

    keys    value   >5
0   key1    2   0.5
1   key2    2   0.5
2   key3    1   1.0

EDIT: to change column name from value to Total

df.rename(columns={'value': 'total'}, inplace=True)