I have a dataset with null values in the column 'revenues_from_appointment'
Dataset
appointment_date | patient_id | practitioner_id | appointment_duration_min | revenues_from_appointment |
---|---|---|---|---|
2021-06-28 | 42734 | 748 | 30 | 90.0 |
2021-06-29 | 42737 | 747 | 60 | 150.0 |
2021-07-01 | 42737 | 747 | 60 | NaN |
2021-07-03 | 42736 | 748 | 30 | 60.0 |
2021-07-03 | 42735 | 747 | 15 | 42.62 |
2021-07-04 | 42734 | 748 | 30 | NaN |
2021-07-05 | 42734 | 748 | 30 | 100.0 |
2021-07-10 | 42738 | 747 | 15 | 50.72 |
2021-08-12 | 42739 | 748 | 30 | 73.43 |
I wish to replace NULL values by the mean value of rows where "patient_id, practitioner_id, appointment_duration_min" is the same.
I did it using pandas dataframe,
df['revenues_from_appointment'].fillna(df.groupby(['patient_id','practitioner_id','appointment_duration_min'])['revenues_from_appointment'].transform('mean'), inplace = True)
How can we obtain the same result by using SQL?
Final Output
appointment_date | patient_id | practitioner_id | appointment_duration_min | revenues_from_appointment |
---|---|---|---|---|
2021-06-28 | 42734 | 748 | 30 | 90.0 |
2021-06-29 | 42737 | 747 | 60 | 150.0 |
2021-07-01 | 42737 | 747 | 60 | 150.0 |
2021-07-03 | 42736 | 748 | 30 | 60.0 |
2021-07-03 | 42735 | 747 | 15 | 42.62 |
2021-07-04 | 42734 | 748 | 30 | 95.0 |
2021-07-05 | 42734 | 748 | 30 | 100.0 |
2021-07-10 | 42738 | 747 | 15 | 50.72 |
2021-08-12 | 42739 | 748 | 30 | 73.43 |
CodePudding user response:
You can use the AVG
window function, that will partition on the three column of interest and replace null values using the COALESCE
function:
SELECT appointment_date,
patient_id,
practitioner_id,
appointment_duration_min,
COALESCE(revenues_from_appointment,
AVG(revenues_from_appointment) OVER(PARTITION BY patient_id,
practitioner_id,
appointment_duration_min))
FROM tab
Try it here.