I have a csv file containing infection data for hospital patients from 2007-present, imported into a pandas dataframe. Each entry contains standard patient details like age, bmi etc. If the patient has an infection, they will also have data on: the date the infection was found, what microbe it is (number coded), and whether it is gram positive/negative/fungi/unknown. A patient can have at most 3 different microbes. Out of 18k entries, about 400 have infections. You can see a (simplified) snapshot of the dataset below. (Standard patient details have been removed as they're not relevant to my question)
PatientID DateOp DateOnset 1_CODE 1_TYPE 2_CODE 2_TYPE 3_CODE 3_TYPE
1 21/08/2011 28/08/2011 276 Gram Neg
2 31/08/2011 08/10/2011 922 Fungi
3 01/12/2011 15/12/2011 113 Gram Neg 780 Gram Pos
4 01/12/2011 08/12/2011 113 Gram Neg
5 20/03/2012 24/03/2012 424 Gram Neg 276 Gram Neg
6 16/12/2011 21/12/2011 221 Gram Neg
7 23/04/2012 26/04/2012 922 Gram Pos 311 Gram Neg 780 Gram Pos
8 19/05/2012 29/05/2012 772 Gram Pos
9 08/02/2012 13/02/2012 Unknown
I want to see how many of each microbe type (pos/neg/fungi/other) has occurred in each month since 2007 (using DateOnset
as the date), and plot this in a suitable chart - for example a stacked bar-chart coloured by microbe type.
I thought I could do this by creating new columns in ssidf
: gram_pos
, gram_neg
, fungi
, other
. Then, iterate through 1_TYPE
, 2_TYPE
, 3_TYPE
, and if there is an infection in a row, add 1
to the necessary column(s). This is how I thought I could achieve this:
for row in ssidf.1_TYPE:
if ssidf.1_TYPE[row].str.contains('Gram Pos') is True:
ssidf.gram_pos[row] == 1
elif ssidf.1_TYPE[row].str.contains('Gram Neg') is True:
ssidf.gram_neg[row] == 1
elif ssidf.1_TYPE[row].str.contains('Fungi') is True:
ssidf.fungi[row] == 1
else:
ssidf.unknown[row] == 1
This is my expected result:
DateOnset gram_pos gram_neg fungi unknown
28/08/2011 0 1 0 0
08/10/2011 0 0 1 0
15/12/2011 1 1 0 0
08/12/2011 0 1 0 0
24/03/2012 0 2 0 0
21/12/2011 1 0 0 0
26/04/2012 2 1 0 0
29/05/2012 1 0 0 0
13/02/2012 0 0 0 1
Then, I would sum the totals of gram_pos
, gram_neg
, fungi
, other
per month, and represent them in a stacked bar chart.
However, I'm falling at the first hurdle - the for loop. I keep getting: KeyError: 'Gram Neg'
. I haven't even tried to get it to iterate across 2_TYPE
and 3_TYPE
yet, because the first column isn't working. Can anyone see what the problem is and suggest how to fix?
CodePudding user response:
Note that iterating in pandas is not recommended. It's more idiomatic to reshape and cross-tabulate.
melt
theDateOnset
/TYPE
columns withDateOnset
as the id:melted = ssidf[['DateOnset', '1_TYPE', '2_TYPE', '3_TYPE']].melt('DateOnset') # DateOnset variable value # 0 28/08/2011 1_TYPE Gram Neg # 1 08/10/2011 1_TYPE Fungi # 2 15/12/2011 1_TYPE Gram Neg # .. ... ... ... # 24 26/04/2012 3_TYPE Gram Pos # 25 29/05/2012 3_TYPE NaN # 26 13/02/2012 3_TYPE NaN
crosstab
the meltedDateOnset
vs meltedvalue
:xtab = pd.crosstab(melted['DateOnset'], melted['value']) # value Fungi Gram Neg Gram Pos Unknown # DateOnset # 08/10/2011 1 0 0 0 # 08/12/2011 0 1 0 0 # 13/02/2012 0 0 0 1 # 15/12/2011 0 1 1 0 # 21/12/2011 0 1 0 0 # 24/03/2012 0 2 0 0 # 26/04/2012 0 1 2 0 # 28/08/2011 0 1 0 0 # 29/05/2012 0 0 1 0
Aggregate the monthly counts using
groupby.sum
(optionally renaming the index withrename_axis
):xtab.index = pd.to_datetime(xtab.index) (xtab.groupby([xtab.index.year, xtab.index.month]).sum() .rename_axis(['YearOnset', 'MonthOnset'])) # value Fungi Gram Neg Gram Pos Unknown # YearOnset MonthOnset # 2011 8 1 2 0 0 # 12 0 2 1 0 # 2012 2 0 0 0 1 # 3 0 2 0 0 # 4 0 1 2 0 # 5 0 0 1 0