I have a large csv file with the following format (example), the report_date is currently empty:
| ids | disease_code | report_date |
| --- | ------------ | ----------- |
| 10 | I202 | |
| 11 | I232 | |
| 11 | I242 | |
I generated a list of tuples from a data source like the following:
[(10, ['I202'], 2021-10-22), (11, ['I232', 'I242'], 2021-11-22), (11, ['I232', 'I242'], 2021-11-12),.....]
The above order is patient_id, disease_code and the reported_date (The dates are in order corresponding to the disease), for a patient who has more than one disease, the reported date was unfortunately separated into two tuples. Now I want to fill the report_date column by matching the first two values of the tuple with the current csv, like this:
| ids | disease_code | report_date |
| --- | ------------ | ----------- |
| 10 | I202 | 2021-10-22 |
| 11 | I232 | 2021-11-22 |
| 11 | I242 | 2021-11-12 |
I tried to use a nested loop but it seems like it will take 480 hours to complete. I believe there is a more simple answer but I could not figure it out. Any hint would be appreciated.
CodePudding user response:
First, you can create a dataframe with your data. You'll see that the column "disease_code"
contains a list of values, just as you mentioned:
>> df = pd.DataFrame(
[(10, ['I202'], "2021-10-22"), (11, ['I232', 'I242'], "2021-11-22"), (11, ['I232', 'I242'], "2021-11-12")],
columns=["ids", "disease_code", "report_date"],
)
>> df["report_date"] = pd.to_datetime(df["report_date"])
>> df
ids disease_code report_date
0 10 [I202] 2021-10-22
1 11 [I232, I242] 2021-11-22
2 11 [I232, I242] 2021-11-12
Now you need to separate the values in the "disease_code"
column by repeating the values in the other columns... pd.DataFrame.explode
does exactly that. This method transforms values in a list-like column to multiple rows:
>> df.explode(["disease_code"]) # Explode the "disease_code" column
ids disease_code report_date
0 10 I202 2021-10-22
1 11 I232 2021-11-22
1 11 I242 2021-11-22
2 11 I232 2021-11-12
2 11 I242 2021-11-12
CodePudding user response:
For new DataFrame use list comprehension:
L = [(10, ['I202'], '2021-10-22'),
(11, ['I232', 'I242'], '2021-11-22'),
(11, ['I232', 'I242'], '2021-11-12')]
df1 = pd.DataFrame([(a, x, c) for a, b, c in L for x in b],
columns=["ids", "disease_code", "report_date"])
print (df1)
ids disease_code report_date
0 10 I202 2021-10-22
1 11 I232 2021-11-22
2 11 I242 2021-11-22
3 11 I232 2021-11-12
4 11 I242 2021-11-12
Then DataFrame.merge
to original DataFrame df
, but because there are duplicates in ids, disease_code
columns first remove them:
print (df)
ids disease_code report_date
0 10 I202 NaN
1 11 I232 NaN
2 11 I242 NaN
print (df1.drop_duplicates(['ids','disease_code']))
ids disease_code report_date
0 10 I202 2021-10-22
1 11 I232 2021-11-22
2 11 I242 2021-11-22
df = (df.drop('report_date', axis=1)
.merge(df1.drop_duplicates(['ids','disease_code']),
on=['ids','disease_code']))
print (df)
ids disease_code report_date
0 10 I202 2021-10-22
1 11 I232 2021-11-22
2 11 I242 2021-11-22