Home > Software design >  Merge together Longitudinal data
Merge together Longitudinal data

Time:10-18

I collected data in longitudinal format(wide) and this data was being captured basing on different events. Here is a sample of how the data is spread out.

import pandas as pd
  
# intialise data of lists.
data = {'id':['45001', '45001', '45001', '45001'],
        'events':['preliminary', "24hour", '48hr', 'day30_review'],
        'date_birth':['26-01-1990', "Nan", 'Nan', 'Nan'],
        'full_name':['John', "Nan", 'Nan', 'Nan'],
        'adm_diagnosis':['Nan', "Severe pneumonia", "mild pneumonia", 'Nan'],
        'sec_diagnosis':['Nan', "Nan", 'Nan', 'Discharged'],
        
        
        }
  
# Create DataFrame
df = pd.DataFrame(data)

Now at Analysis stage, I did like to merge this data into one combined dataframe so that to reduce the number on rows. Notice that Id is duplicated a couple of times so that to cover all events and an event with remain empty until data for it is captured.

Based on shared sample, is it possible to merge the rows possibly on preliminary events so that I reduce duplicate id?

CodePudding user response:

is that something that you're looking for?

df.replace('Nan', '', inplace=True)
df.groupby('id', as_index=False).agg(list)

id  events  date_birth  full_name   adm_diagnosis   sec_diagnosis
0   45001   [preliminary, 24hour, 48hr, day30_review]   [26-01-1990, , , ]  [John, , , ]    [, Severe pneumonia, mild pneumonia, ]  [, , , Discharged]

CodePudding user response:

As I understand it, what makes most sense to me is to have one row in the DF for each person and columns with each item of data so any NaN would be for data not (yet) available. If this is correct the code below would provide it, although it looks rather messy.

df = pd.DataFrame({'id': [data['id'][0]],
                   'full_name':[data['full_name'][0]],
                   'DoB':[data['date_birth'][0]],
                   'prelim':[data['adm_diagnosis'][0]],
                   '24h':[data['adm_diagnosis'][1]],
                   '48h':[data['adm_diagnosis'][2]],
                   'day30':[data['adm_diagnosis'][2]],
                   'sec_diag':[data['sec_diagnosis'][3]],
                   })
  • Related