Home > Blockchain >  Pandas: Generate column on groupby and value_counts
Pandas: Generate column on groupby and value_counts

Time:12-13

Goal is to generate a column pct group by id where 'pct' = (1st value of 'pts' group by 'id' * 100) / number of same consecutive 'id' value where 'x' and 'y' both are 'NaN'. For e.g. when id=1, pct = (5*100) / 2 = 250. It will loop through whole dataframe.

Sample df:

   id pts   x       y
0   1   5   NaN     NaN
1   1   5   1.0     NaN
2   1   5   NaN     NaN
3   2   8   NaN     NaN
4   2   8   2.0     1.0
5   3   7   NaN     NaN
6   3   7   NaN     5.0
7   3   7   NaN     NaN
8   3   7   NaN     NaN
9   4   1   NaN     NaN

Expected Output:

   id pts   x       y       pct
0   1   5   NaN     NaN     250
1   1   5   1.0     NaN     250
2   1   5   NaN     NaN     250
3   2   8   NaN     NaN     800
4   2   8   2.0     1.0     800
5   3   7   NaN     NaN     233
6   3   7   NaN     5.0     233
7   3   7   NaN     NaN     233
8   3   7   NaN     NaN     233
9   4   1   NaN     NaN     100

I tried:

df['pct'] = df.groupby('id')['pts']/df.groupby('id')['x']['y'].count(axis=1)* 100

CodePudding user response:

This works:

df['pct'] = df['id'].map(df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100 // x[['x', 'y']].isna().sum(axis=1).eq(2).sum()))

Output:

>>> df
   id  pts    x    y  pct
0   1    5  NaN  NaN  250
1   1    5  1.0  NaN  250
2   1    5  NaN  NaN  250
3   2    8  NaN  NaN  800
4   2    8  2.0  1.0  800
5   3    7  NaN  NaN  233
6   3    7  NaN  5.0  233
7   3    7  NaN  NaN  233
8   3    7  NaN  NaN  233
9   4    1  NaN  NaN  100

Explanation

>>> df[['x', 'y']]
     x    y
0  NaN  NaN
1  1.0  NaN
2  NaN  NaN
3  NaN  NaN
4  2.0  1.0
5  NaN  NaN
6  NaN  5.0
7  NaN  NaN
8  NaN  NaN
9  NaN  NaN

First, we create a mask of the selected x and y columns where each value is True if it's not NaN and False if it is NaN:

>>> df[['x', 'y']].isna()
0   True   True
1  False   True
2   True   True
3   True   True
4  False  False
5   True   True
6   True  False
7   True   True
8   True   True
9   True   True

Next, we count how many NaNs were in each row by summing horizontally. Since True is interepreted as 1 and False as 0, this will work:

>>> df[['x', 'y']].isna().sum(axis=1)
0    2
1    1
2    2
3    2
4    0
5    2
6    1
7    2
8    2
9    2

Then, we count how many rows had 2 NaN values (2 because x and y are 2 columns):

>>> df[['x', 'y']].isna().sum(axis=1).eq(2)
0     True
1    False
2     True
3     True
4    False
5     True
6    False
7     True
8     True
9     True

Finally, we count how many True values there were (a True value means that row contained only NaNs), by summing the True values again:

>>> df[['x', 'y']].isna().sum(axis=1).eq(2).sum()
7

Of course, we do this in a .groupby(...).apply(...) call, so this code gets executed for each group of id, not across the whole dataframe like this explanation has done. But the concepts are identical:

>>> df.groupby('id').apply(lambda x: x[['x', 'y']].isna().sum(axis=1).eq(2).sum())
id
1    2
2    1
3    3
4    1
dtype: int64

So for id = 1, 2 rows have x and y NaN. For id = 2, 1 row has x and y NaN. And so on...


The other (first) part of the code in the groupby call:

x['pts'].iloc[0] * 100

All it does is, for each group, it selects the 0th (first) value, and multiplies it by 100:

>>> df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100)
id
1    500
2    800
3    700
4    100
dtype: int64

Combined with the other code just explained:

>>> df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100 // x[['x', 'y']].isna().sum(axis=1).eq(2).sum())
id
1    250
2    800
3    233
4    100
dtype: int64

Finally, we map the values in id to the values we've just computed (notice in the above that the numbers are indexes by the values of id):

>>> df['id']
0    1
1    1
2    1
3    2
4    2
5    3
6    3
7    3
8    3
9    4
Name: id, dtype: int64

>>> computed = df.groupby('id').apply(lambda x: x['pts'].iloc[0] * 100 // x[['x', 'y']].isna().sum(axis=1).eq(2).sum())
>>> computed
id
1    250
2    800
3    233
4    100
dtype: int64

>>> df['id'].map(computed)
0    250
1    250
2    250
3    800
4    800
5    233
6    233
7    233
8    233
9    100
Name: id, dtype: int64
  • Related