Pandas subtract dates to get a surgery patient length of stay-CodePudding

I have a dataframe of surgical activity with admission dates (ADMIDATE) and discharge dates (DISDATE). It is 600k rows by 78 columns but I have filtered it for a particular surgery. I want to calculate the length of stay and add it as a further column. Usually I use

df["los"] = (df["DISDATE"] - df["ADMIDATE"]).dt.days

I recently had to clean the data and must have done it in a different way to previously because I am now getting a negative los, eg.

DISDATE.	. ADMIDATE.	. los.
2019-12-24	2019-12-08	-43805.
2019-05-15	. 2019-03-26	50.
2019-10-11	. 2019-10-07	4.
2019-06-20	2019-06-16	4
2019-04-11	2019-04-08	3

df.info()

df.info()

    <class '`pandas`.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 78 columns):

5 ADMIDATE 5 non-null datetime64[ns] 28 DISDATE 5 non-null datetime64[ns]

<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 78 columns):

5   ADMIDATE          5 non-null      datetime64[ns]
28  DISDATE           5 non-null      datetime64[ns]

I am not sure how to ask the right questions to the problem, - and why its only affecting some rows. In cleansing the data some of the DISDATES had to be populated from another column (also a date column) becuase they were incomplete, and I wonder if it is these which are negative due to some retention of the orig data soemhow even though printing the new DISDATE looks ok.

CodePudding user response：

Your sample works well with the right output (16 days for the first row)

Can you try that and check if the problem persists:

import io

data = df[['DISDATE', 'ADMIDATE']].to_csv()
test = pd.read_csv(io.StringIO(data), index_col=0,
                   parse_dates=['DISDATE', 'ADMIDATE'])

print(test['DISDATE'].sub(test['ADMIDATE']).dt.days)

Output:

0    16
1    50
2     4
3     4
4     3
dtype: int64

Update

To debug your bad dates, try:

df.loc[pd.to_datetime(df['ADMIDATE'], errors='coerce').isna(), 'ADMIDATE']

You should see rows where values are not a right date.