How to quickly add a large list of value to the corresponding python pandas dataframe-CodePudding

I have a large csv file with the following format (example), the report_date is currently empty:

| ids | disease_code | report_date |
| --- | ------------ | ----------- |
| 10  |    I202      |             |
| 11  |    I232      |             |
| 11  |    I242      |             |

I generated a list of tuples from a data source like the following:

[(10, ['I202'], 2021-10-22), (11, ['I232', 'I242'], 2021-11-22), (11, ['I232', 'I242'], 2021-11-12),.....]

The above order is patient_id, disease_code and the reported_date (The dates are in order corresponding to the disease), for a patient who has more than one disease, the reported date was unfortunately separated into two tuples. Now I want to fill the report_date column by matching the first two values of the tuple with the current csv, like this:

| ids | disease_code | report_date |
| --- | ------------ | ----------- |
| 10  |    I202      | 2021-10-22  |
| 11  |    I232      | 2021-11-22  |
| 11  |    I242      | 2021-11-12  |

I tried to use a nested loop but it seems like it will take 480 hours to complete. I believe there is a more simple answer but I could not figure it out. Any hint would be appreciated.

CodePudding user response：

First, you can create a dataframe with your data. You'll see that the column "disease_code" contains a list of values, just as you mentioned:

>> df = pd.DataFrame(
    [(10, ['I202'], "2021-10-22"), (11, ['I232', 'I242'], "2021-11-22"), (11, ['I232', 'I242'], "2021-11-12")],
    columns=["ids", "disease_code", "report_date"],
)
>> df["report_date"] = pd.to_datetime(df["report_date"])
>> df
   ids  disease_code report_date
0   10        [I202]  2021-10-22
1   11  [I232, I242]  2021-11-22
2   11  [I232, I242]  2021-11-12

Now you need to separate the values in the "disease_code" column by repeating the values in the other columns... pd.DataFrame.explode does exactly that. This method transforms values in a list-like column to multiple rows:

>> df.explode(["disease_code"])  # Explode the "disease_code" column
   ids disease_code report_date
0   10         I202  2021-10-22
1   11         I232  2021-11-22
1   11         I242  2021-11-22
2   11         I232  2021-11-12
2   11         I242  2021-11-12

CodePudding user response：

For new DataFrame use list comprehension:

L = [(10, ['I202'], '2021-10-22'), 
     (11, ['I232', 'I242'], '2021-11-22'),
     (11, ['I232', 'I242'], '2021-11-12')]

df1 = pd.DataFrame([(a, x, c) for a, b, c in L for x in b], 
                   columns=["ids", "disease_code", "report_date"])
print (df1)
   ids disease_code report_date
0   10         I202  2021-10-22
1   11         I232  2021-11-22
2   11         I242  2021-11-22
3   11         I232  2021-11-12
4   11         I242  2021-11-12

Then DataFrame.merge to original DataFrame df, but because there are duplicates in ids, disease_code columns first remove them:

print (df)
   ids disease_code  report_date
0   10         I202          NaN
1   11         I232          NaN
2   11         I242          NaN

print (df1.drop_duplicates(['ids','disease_code']))
   ids disease_code report_date
0   10         I202  2021-10-22
1   11         I232  2021-11-22
2   11         I242  2021-11-22

df = (df.drop('report_date', axis=1)
        .merge(df1.drop_duplicates(['ids','disease_code']), 
               on=['ids','disease_code']))
print (df)
   ids disease_code report_date
0   10         I202  2021-10-22
1   11         I232  2021-11-22
2   11         I242  2021-11-22