Home > Blockchain >  Python count occurrences of different item types
Python count occurrences of different item types

Time:12-07

I have a csv file containing infection data for hospital patients from 2007-present, imported into a pandas dataframe. Each entry contains standard patient details like age, bmi etc. If the patient has an infection, they will also have data on: the date the infection was found, what microbe it is (number coded), and whether it is gram positive/negative/fungi/unknown. A patient can have at most 3 different microbes. Out of 18k entries, about 400 have infections. You can see a (simplified) snapshot of the dataset below. (Standard patient details have been removed as they're not relevant to my question)

  PatientID DateOp  DateOnset   1_CODE  1_TYPE  2_CODE  2_TYPE  3_CODE  3_TYPE
1   21/08/2011  28/08/2011  276 Gram Neg                
2   31/08/2011  08/10/2011  922 Fungi               
3   01/12/2011  15/12/2011  113 Gram Neg    780 Gram Pos        
4   01/12/2011  08/12/2011  113 Gram Neg                    
5   20/03/2012  24/03/2012  424 Gram Neg    276 Gram Neg        
6   16/12/2011  21/12/2011  221 Gram Neg                
7   23/04/2012  26/04/2012  922 Gram Pos    311 Gram Neg    780 Gram Pos
8   19/05/2012  29/05/2012  772 Gram Pos                
9   08/02/2012  13/02/2012      Unknown

I want to see how many of each microbe type (pos/neg/fungi/other) has occurred in each month since 2007 (using DateOnset as the date), and plot this in a suitable chart - for example a stacked bar-chart coloured by microbe type.

I thought I could do this by creating new columns in ssidf: gram_pos, gram_neg, fungi, other. Then, iterate through 1_TYPE, 2_TYPE, 3_TYPE, and if there is an infection in a row, add 1 to the necessary column(s). This is how I thought I could achieve this:

for row in ssidf.1_TYPE:
   if ssidf.1_TYPE[row].str.contains('Gram Pos') is True:
      ssidf.gram_pos[row] == 1
   elif ssidf.1_TYPE[row].str.contains('Gram Neg') is True:
      ssidf.gram_neg[row] == 1
   elif ssidf.1_TYPE[row].str.contains('Fungi') is True:
      ssidf.fungi[row] == 1
   else:
      ssidf.unknown[row] == 1

This is my expected result:

   DateOnset    gram_pos    gram_neg    fungi   unknown
28/08/2011  0   1   0   0
08/10/2011  0   0   1   0
15/12/2011  1   1   0   0
08/12/2011  0   1   0   0
24/03/2012  0   2   0   0
21/12/2011  1   0   0   0
26/04/2012  2   1   0   0
29/05/2012  1   0   0   0
13/02/2012  0   0   0   1

Then, I would sum the totals of gram_pos, gram_neg, fungi, other per month, and represent them in a stacked bar chart.

However, I'm falling at the first hurdle - the for loop. I keep getting: KeyError: 'Gram Neg'. I haven't even tried to get it to iterate across 2_TYPE and 3_TYPE yet, because the first column isn't working. Can anyone see what the problem is and suggest how to fix?

CodePudding user response:

Note that iterating in pandas is not recommended. It's more idiomatic to reshape and cross-tabulate.

  1. melt the DateOnset/TYPE columns with DateOnset as the id:

    melted = ssidf[['DateOnset', '1_TYPE', '2_TYPE', '3_TYPE']].melt('DateOnset')
    
    #      DateOnset variable     value
    # 0   28/08/2011   1_TYPE  Gram Neg
    # 1   08/10/2011   1_TYPE     Fungi
    # 2   15/12/2011   1_TYPE  Gram Neg
    # ..         ...      ...       ...
    # 24  26/04/2012   3_TYPE  Gram Pos
    # 25  29/05/2012   3_TYPE       NaN
    # 26  13/02/2012   3_TYPE       NaN
    
  2. crosstab the melted DateOnset vs melted value:

    xtab = pd.crosstab(melted['DateOnset'], melted['value'])
    
    # value       Fungi  Gram Neg  Gram Pos  Unknown
    # DateOnset                                     
    # 08/10/2011      1         0         0        0
    # 08/12/2011      0         1         0        0
    # 13/02/2012      0         0         0        1
    # 15/12/2011      0         1         1        0
    # 21/12/2011      0         1         0        0
    # 24/03/2012      0         2         0        0
    # 26/04/2012      0         1         2        0
    # 28/08/2011      0         1         0        0
    # 29/05/2012      0         0         1        0
    
  3. Aggregate the monthly counts using groupby.sum (optionally renaming the index with rename_axis):

    xtab.index = pd.to_datetime(xtab.index)
    (xtab.groupby([xtab.index.year, xtab.index.month]).sum()
         .rename_axis(['YearOnset', 'MonthOnset']))
    
    # value                 Fungi  Gram Neg  Gram Pos  Unknown
    # YearOnset MonthOnset                                    
    # 2011      8               1         2         0        0
    #           12              0         2         1        0
    # 2012      2               0         0         0        1
    #           3               0         2         0        0
    #           4               0         1         2        0
    #           5               0         0         1        0
    
  • Related