Home > OS >  Compare two values and add them to list if they match - code improvement
Compare two values and add them to list if they match - code improvement

Time:02-09

I am working with medical history data and need to figure out to find the series of diagnoses a patient has had. I have a large database which includes data on unique ID, diagnosis, time of contact with healthcare, and so on.

I made some dummy data here to illustrate:

import pandas as pd
import numpy as np

columns = ["ID","DIAG","TYPE","IN","OUT","GENDER","DOB"]
diags = pd.DataFrame(np.random.randint(0,100,size=(2000,7)),columns=columns)
diags_counter = diags.groupby("ID")["DIAG"].count().to_frame()
diags_counter.reset_index(level=0,inplace=True)

I reset the index, since the IDs in the database are more complex, and using .loc wouldn't work.

My idea was to make a list or dictionary of a list of dataframes for each patient, i.e. a patient might have only one diagnosis, while another might have 2 or more. Following code is working, but extremely slow, and since I have over half a million patients, this is not going to work:

diags_seq = []
for i in range(0,len(diags_counter)):
  X= []
  for j in range(0,len(diags)):
    if diags_counter.ID.iloc[i] == diags.ID.iloc[j]:
      X.append(diags.iloc[j])
  diags_seq.append(X)
  print(f"\r{i 1} of {len(diags_counter)} found", end="")

Any help to how to approach this otherwise would be greatly appreciated :)

CodePudding user response:

I think this will be fine:

unique_id = diags.ID.unique()
dict_of_specifics_id = {}
for id in unique_id:
   dict_of_specifics_id[id] = {}
   dict_of_specifics_id[id]['id_counter'] = 0
   dict_of_specifics_id[id]['diag_list'] = []
for index, row in diags.iterrows():
   dict_of_specifics_id[row.ID]['id_counter'] =1
   dict_of_specifics_id[row.ID]['diag_list'].append(row.DIAG)
dict_of_specifics_id   

Output:

{21: {'id_counter': 16,
  'diag_list': [45, 41, 92, 91, 62, 54, 16, 18, 23, 18, 0, 47, 9, 45, 2, 61]},

Where 21 is ID and diag_list is list of diagnosis to this ID. id_counter is actualy len of diag_list.

CodePudding user response:

So after some tweaking, I ended up with using the following:

dict_of_specifics_id = {}
for id in unique_id:
   dict_of_specifics_id[id] = {}
   dict_of_specifics_id[id]['diag_list'] = []
for index, row in diags.iterrows():
   dict_of_specifics_id[row.ID]['diag_list'].append(row)

seq_diags_list = list(dict_of_specifics_id.values())

seq_diags = []
for i in range(0,len(seq_diags_list):
   X = list(seq_diags_list[i].values())
   df = pd.concat(X[0],axis=1).T
   seq_diags.append(df)

This ends up giving a list of dataframes, where the length of each dataframe is represented in the number of diagnoses associated with each patient/id.

PS: It takes around 10 seconds to run for around 600,000 data points, so much better than before.

  •  Tags:  
  • Related