This is a simplified version of my problem: given this dataset
| link | category |
| ---- | -------- |
| 1 | 0 |
| 1 | 0 |
| 1 | 1 |
| 2 | 0 |
| 3 | 1 |
| 3 | 1 |
I would like to obtain the following:
| link | category counts |
| ---- | --------------- |
| 1 | 0 : 2 |
| | 1 : 1 |
| ---- | --------------- |
| 3 | 0 : 0 |
| | 1 : 2 |
Basically, i need to group data by link
, then if the link occurs more than once, i need to apply the value_counts
function to category
, such as each level is displayed (e.g. for link n° 3 both levels of category are displayed, even if category 0 never occurs).
My original problem refers to a bigger dataset with many more links and categories, but solving this simplified version should help me a lot.
Here is the code for generating the simplified dataset:
df = pd.DataFrame({'link' : [1,1,1,2,3,3],
'category' : [0,0,1,0,1,1]})
df
So far I used different strategies to reach my goal without success. The closest one seems to be this:
df.groupby('link')['category'].value_counts(sort=False)
That gives the following:
| link | category counts |
| ---- | --------------- |
| 1 | 0 : 2 |
| | 1 : 1 |
| ---- | --------------- |
| 2 | 0 : 1 |
| ---- | --------------- |
| 3 | 1 : 2 |
Here two issues are visible: The row with link n° 2 shouldn't be displayed, and category 0 for link n° 3 should be present, even if its count is zero.
Thanks to all those who will try to help me.
Alberto
CodePudding user response:
You can use crosstab
and stack
:
(pd.crosstab(df['link'], df['category'])
.reindex(columns=[0, 1], fill_value=0) # safer in case no link has 1
.stack().reset_index(name='counts')
)
output:
link category counts
0 1 0 2
1 1 1 1
2 2 0 1
3 2 1 0
4 3 0 0
5 3 1 2
To keep only the ids with at least one 1
:
(pd.crosstab(df['link'], df['category'])
.reindex(columns=[0, 1], fill_value=0)
.loc[lambda d: d[1].gt(0)]
.stack().reset_index(name='counts')
)
output:
link category counts
0 1 0 2
1 1 1 1
2 3 0 0
3 3 1 2
CodePudding user response:
here is another way to do it
df.groupby(['link', 'category'])['category'].agg(count='count').reset_index()
link category count
0 1 0 2
1 1 1 1
2 2 0 1
3 3 1 2