Python count occurrences of different item types-CodePudding

I have a csv file containing infection data for hospital patients from 2007-present, imported into a pandas dataframe. Each entry contains standard patient details like age, bmi etc. If the patient has an infection, they will also have data on: the date the infection was found, what microbe it is (number coded), and whether it is gram positive/negative/fungi/unknown. A patient can have at most 3 different microbes. Out of 18k entries, about 400 have infections. You can see a (simplified) snapshot of the dataset below. (Standard patient details have been removed as they're not relevant to my question)

  PatientID DateOp  DateOnset   1_CODE  1_TYPE  2_CODE  2_TYPE  3_CODE  3_TYPE
1   21/08/2011  28/08/2011  276 Gram Neg                
2   31/08/2011  08/10/2011  922 Fungi               
3   01/12/2011  15/12/2011  113 Gram Neg    780 Gram Pos        
4   01/12/2011  08/12/2011  113 Gram Neg                    
5   20/03/2012  24/03/2012  424 Gram Neg    276 Gram Neg        
6   16/12/2011  21/12/2011  221 Gram Neg                
7   23/04/2012  26/04/2012  922 Gram Pos    311 Gram Neg    780 Gram Pos
8   19/05/2012  29/05/2012  772 Gram Pos                
9   08/02/2012  13/02/2012      Unknown

I want to see how many of each microbe type (pos/neg/fungi/other) has occurred in each month since 2007 (using DateOnset as the date), and plot this in a suitable chart - for example a stacked bar-chart coloured by microbe type.

I thought I could do this by creating new columns in ssidf: gram_pos, gram_neg, fungi, other. Then, iterate through 1_TYPE, 2_TYPE, 3_TYPE, and if there is an infection in a row, add 1 to the necessary column(s). This is how I thought I could achieve this:

for row in ssidf.1_TYPE:
   if ssidf.1_TYPE[row].str.contains('Gram Pos') is True:
      ssidf.gram_pos[row] == 1
   elif ssidf.1_TYPE[row].str.contains('Gram Neg') is True:
      ssidf.gram_neg[row] == 1
   elif ssidf.1_TYPE[row].str.contains('Fungi') is True:
      ssidf.fungi[row] == 1
   else:
      ssidf.unknown[row] == 1

This is my expected result:

   DateOnset    gram_pos    gram_neg    fungi   unknown
28/08/2011  0   1   0   0
08/10/2011  0   0   1   0
15/12/2011  1   1   0   0
08/12/2011  0   1   0   0
24/03/2012  0   2   0   0
21/12/2011  1   0   0   0
26/04/2012  2   1   0   0
29/05/2012  1   0   0   0
13/02/2012  0   0   0   1

Then, I would sum the totals of gram_pos, gram_neg, fungi, other per month, and represent them in a stacked bar chart.

However, I'm falling at the first hurdle - the for loop. I keep getting: KeyError: 'Gram Neg'. I haven't even tried to get it to iterate across 2_TYPE and 3_TYPE yet, because the first column isn't working. Can anyone see what the problem is and suggest how to fix?

CodePudding user response：

Note that iterating in pandas is not recommended. It's more idiomatic to reshape and cross-tabulate.

melt the DateOnset/TYPE columns with DateOnset as the id:

melted = ssidf[['DateOnset', '1_TYPE', '2_TYPE', '3_TYPE']].melt('DateOnset')

#      DateOnset variable     value
# 0   28/08/2011   1_TYPE  Gram Neg
# 1   08/10/2011   1_TYPE     Fungi
# 2   15/12/2011   1_TYPE  Gram Neg
# ..         ...      ...       ...
# 24  26/04/2012   3_TYPE  Gram Pos
# 25  29/05/2012   3_TYPE       NaN
# 26  13/02/2012   3_TYPE       NaN

crosstab the melted DateOnset vs melted value:

xtab = pd.crosstab(melted['DateOnset'], melted['value'])

# value       Fungi  Gram Neg  Gram Pos  Unknown
# DateOnset                                     
# 08/10/2011      1         0         0        0
# 08/12/2011      0         1         0        0
# 13/02/2012      0         0         0        1
# 15/12/2011      0         1         1        0
# 21/12/2011      0         1         0        0
# 24/03/2012      0         2         0        0
# 26/04/2012      0         1         2        0
# 28/08/2011      0         1         0        0
# 29/05/2012      0         0         1        0

Aggregate the monthly counts using groupby.sum (optionally renaming the index with rename_axis):

xtab.index = pd.to_datetime(xtab.index)
(xtab.groupby([xtab.index.year, xtab.index.month]).sum()
     .rename_axis(['YearOnset', 'MonthOnset']))

# value                 Fungi  Gram Neg  Gram Pos  Unknown
# YearOnset MonthOnset                                    
# 2011      8               1         2         0        0
#           12              0         2         1        0
# 2012      2               0         0         0        1
#           3               0         2         0        0
#           4               0         1         2        0
#           5               0         0         1        0