Home > other >  Pandas subtract dates to get a surgery patient length of stay
Pandas subtract dates to get a surgery patient length of stay

Time:09-17

I have a dataframe of surgical activity with admission dates (ADMIDATE) and discharge dates (DISDATE). It is 600k rows by 78 columns but I have filtered it for a particular surgery. I want to calculate the length of stay and add it as a further column. Usually I use

df["los"] = (df["DISDATE"] - df["ADMIDATE"]).dt.days

I recently had to clean the data and must have done it in a different way to previously because I am now getting a negative los, eg.

DISDATE. . ADMIDATE. . los.
2019-12-24 2019-12-08 -43805.
2019-05-15 . 2019-03-26 50.
2019-10-11 . 2019-10-07 4.
2019-06-20 2019-06-16 4
2019-04-11 2019-04-08 3
df.info()

df.info()

    <class '`pandas`.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 78 columns):

5 ADMIDATE 5 non-null datetime64[ns] 28 DISDATE 5 non-null datetime64[ns]

<class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 78 columns):

5   ADMIDATE          5 non-null      datetime64[ns]
28  DISDATE           5 non-null      datetime64[ns]

I am not sure how to ask the right questions to the problem, - and why its only affecting some rows. In cleansing the data some of the DISDATES had to be populated from another column (also a date column) becuase they were incomplete, and I wonder if it is these which are negative due to some retention of the orig data soemhow even though printing the new DISDATE looks ok.

CodePudding user response:

Your sample works well with the right output (16 days for the first row)

Can you try that and check if the problem persists:

import io

data = df[['DISDATE', 'ADMIDATE']].to_csv()
test = pd.read_csv(io.StringIO(data), index_col=0,
                   parse_dates=['DISDATE', 'ADMIDATE'])

print(test['DISDATE'].sub(test['ADMIDATE']).dt.days)

Output:

0    16
1    50
2     4
3     4
4     3
dtype: int64

Update

To debug your bad dates, try:

df.loc[pd.to_datetime(df['ADMIDATE'], errors='coerce').isna(), 'ADMIDATE']

You should see rows where values are not a right date.

  • Related