I have a dataframe of surgical activity with admission dates (ADMIDATE) and discharge dates (DISDATE). It is 600k rows by 78 columns but I have filtered it for a particular surgery. I want to calculate the length of stay and add it as a further column. Usually I use
df["los"] = (df["DISDATE"] - df["ADMIDATE"]).dt.days
I recently had to clean the data and must have done it in a different way to previously because I am now getting a negative los, eg.
DISDATE. | . ADMIDATE. | . los. |
---|---|---|
2019-12-24 | 2019-12-08 | -43805. |
2019-05-15 | . 2019-03-26 | 50. |
2019-10-11 | . 2019-10-07 | 4. |
2019-06-20 | 2019-06-16 | 4 |
2019-04-11 | 2019-04-08 | 3 |
df.info()
df.info()
<class '`pandas`.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 78 columns):
5 ADMIDATE 5 non-null datetime64[ns] 28 DISDATE 5 non-null datetime64[ns]
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 78 columns):
5 ADMIDATE 5 non-null datetime64[ns]
28 DISDATE 5 non-null datetime64[ns]
I am not sure how to ask the right questions to the problem, - and why its only affecting some rows. In cleansing the data some of the DISDATES had to be populated from another column (also a date column) becuase they were incomplete, and I wonder if it is these which are negative due to some retention of the orig data soemhow even though printing the new DISDATE looks ok.
CodePudding user response:
Your sample works well with the right output (16 days for the first row)
Can you try that and check if the problem persists:
import io
data = df[['DISDATE', 'ADMIDATE']].to_csv()
test = pd.read_csv(io.StringIO(data), index_col=0,
parse_dates=['DISDATE', 'ADMIDATE'])
print(test['DISDATE'].sub(test['ADMIDATE']).dt.days)
Output:
0 16
1 50
2 4
3 4
4 3
dtype: int64
Update
To debug your bad dates, try:
df.loc[pd.to_datetime(df['ADMIDATE'], errors='coerce').isna(), 'ADMIDATE']
You should see rows where values are not a right date.