Home > Software engineering >  Percentage of element within groupby
Percentage of element within groupby

Time:10-20

I have a CSV with the following data:

          Block_Number Block_Type  Segment_Duration Speaker_ID
0                1      Pause              1.21        FUZ
1                1      Pause              1.09        FUZ
2                1       AICF              2.06        FAN
3                1       AICF              1.23        FUZ
4                1       AICF              1.58        FAN
...            ...        ...               ...        ...
1280            32      Pause              1.03        FUZ
1281            32      Pause              1.16        SIL
1282            32      Pause              2.18        OLN
1283            32        AMF              0.98        FAN
1284            32        AMF              1.78        FAN

I have grouped them by Block Number, Block Type and Speaker ID with this code and filtered the result with the 'CHN' ones only, which gives:

     Block_Number Block_Type Speaker_ID  Count
1              29        CIC        CHN     98
6               9       AICF        CHN     30
7              22        CIC        CHN     29
10             19        CIC        CHN     21
14             13       AICF        CHN     15
28             27        CIC        CHN      9
...            ...        ...      ...      ...   

How would I be able to get the percentage of CHN elements within a block? For example, in block 29, there are 98 counts of 'CHN' out of 120 other labels in this block. So in block 29, there would be 81.67% of CHN. I would like to do this for each of the 32 blocks.

Thank you.

CodePudding user response:

You could do something like the following

category = "Block_Number"
col = "Speaker_ID"
value = "FUZ"

d = {'Block_Number': {'0': '1', '1': '1', '2': '1', '3': '1', '4': '1', '1280': '32', '1281': '32', '1282': '32', '1283': '32', '1284': '32'}, 'Block_Type': {'0': 'Pause', '1': 'Pause', '2': 'AICF', '3': 'AICF', '4': 'AICF', '1280': 'Pause', '1281': 'Pause', '1282': 'Pause', '1283': 'AMF', '1284': 'AMF'}, 'Segment_Duration': {'0': '1.21', '1': '1.09', '2': '2.06', '3': '1.23', '4': '1.58', '1280': '1.03', '1281': '1.16', '1282': '2.18', '1283': '0.98', '1284': '1.78'}, 'Speaker_ID': {'0': 'FUZ', '1': 'FUZ', '2': 'FAN', '3': 'FUZ', '4': 'FAN', '1280': 'FUZ', '1281': 'SIL', '1282': 'OLN', '1283': 'FAN', '1284': 'FAN'}}
df = pd.DataFrame(d)
    
percent_by_block = (
     df[df[col] == value]
     .groupby(by=category)
     .count()[col]
)/(
      df.groupby(by=category)
    .count()[col])

Here, I take df to be the example dataframe

     Block_Number Block_Type Segment_Duration Speaker_ID
0               1      Pause             1.21        FUZ
1               1      Pause             1.09        FUZ
2               1       AICF             2.06        FAN
3               1       AICF             1.23        FUZ
4               1       AICF             1.58        FAN
1280           32      Pause             1.03        FUZ
1281           32      Pause             1.16        SIL
1282           32      Pause             2.18        OLN
1283           32        AMF             0.98        FAN
1284           32        AMF             1.78        FAN

The resulting series percent_by_block is

Block_Number
1     0.6
32    0.2

The results here reflect that 60% (i.e. .6) of the rows with Block_Number 1 have Speaker_ID FUZ, whereas 20% of the rows with Block_Number 32 have Speaker_ID FUZ.

  • Related