Home > Software engineering >  Applying function after Conditional Group By in Python
Applying function after Conditional Group By in Python

Time:09-09

This is a simplified version of my problem: given this dataset

| link | category |
| ---- | -------- |
|  1   |     0    |
|  1   |     0    |
|  1   |     1    |
|  2   |     0    |
|  3   |     1    |
|  3   |     1    |

I would like to obtain the following:

| link | category counts |
| ---- | --------------- |
|  1   |      0 : 2      |
|      |      1 : 1      |
| ---- | --------------- |
|  3   |      0 : 0      |
|      |      1 : 2      |

Basically, i need to group data by link, then if the link occurs more than once, i need to apply the value_counts function to category, such as each level is displayed (e.g. for link n° 3 both levels of category are displayed, even if category 0 never occurs).

My original problem refers to a bigger dataset with many more links and categories, but solving this simplified version should help me a lot.

Here is the code for generating the simplified dataset:

df = pd.DataFrame({'link' : [1,1,1,2,3,3],
               'category' : [0,0,1,0,1,1]})
df

So far I used different strategies to reach my goal without success. The closest one seems to be this:

df.groupby('link')['category'].value_counts(sort=False)

That gives the following:

| link | category counts |
| ---- | --------------- |
|  1   |      0 : 2      |
|      |      1 : 1      |
| ---- | --------------- |
|  2   |      0 : 1      |
| ---- | --------------- |
|  3   |      1 : 2      |

Here two issues are visible: The row with link n° 2 shouldn't be displayed, and category 0 for link n° 3 should be present, even if its count is zero.

Thanks to all those who will try to help me.

Alberto

CodePudding user response:

You can use crosstab and stack:

(pd.crosstab(df['link'], df['category'])
   .reindex(columns=[0, 1], fill_value=0) # safer in case no link has 1 
   .stack().reset_index(name='counts')
)

output:

   link  category  counts
0     1         0       2
1     1         1       1
2     2         0       1
3     2         1       0
4     3         0       0
5     3         1       2

To keep only the ids with at least one 1:

(pd.crosstab(df['link'], df['category'])
   .reindex(columns=[0, 1], fill_value=0)
   .loc[lambda d: d[1].gt(0)]
   .stack().reset_index(name='counts')
)

output:

   link  category  counts
0     1         0       2
1     1         1       1
2     3         0       0
3     3         1       2

CodePudding user response:

here is another way to do it

df.groupby(['link', 'category'])['category'].agg(count='count').reset_index()

    link    category    count
0      1           0        2
1      1           1        1
2      2           0        1
3      3           1        2
  • Related