Home > Mobile >  replacing/re-assign pandas value with new value
replacing/re-assign pandas value with new value

Time:11-04

I wanted to re-assign/replace my new value, from my current

20000123
19850123
19880112
19951201
19850123
20190821
20000512
19850111
19670133
19850123

As you can see there is data with 19670133 (YYYYMMDD), which means that date is not exist since there is no month with 33 days in it.So I wanted to re assign it to the end of the month. I tried to make it to the end of the month, and it works. But when i try to replace the old value with the new ones, it became a problem. What I've tried to do is this :

for x in df_tmp_customer['date']:
    try:
        df_tmp_customer['date'] = df_tmp_customer.apply(pd.to_datetime(x), axis=1)
    except Exception:
        df_tmp_customer['date'] = df_tmp_customer.apply(pd.to_datetime(x[0:6] "01")  pd.offsets.MonthEnd(n=0), axis=1)

This part is the one that makes it end of the month :

pd.to_datetime(x[0:6] "01")  pd.offsets.MonthEnd(n=0)

CodePudding user response:

Probably not efficient on a large dataset but can be done using pendulum.parse()

import pendulum


def parse_dates(x: str) -> pendulum:
    i = 0
    while ValueError:
        try:
            return pendulum.parse(str(int(x) - i)).date()
        except ValueError:
            i  = 1


df["date"] = df["date"].apply(lambda x: parse_dates(x))
print(df)

         date
0  2000-01-23
1  1985-01-23
2  1988-01-12
3  1995-12-01
4  1985-01-23
5  2019-08-21
6  2000-05-12
7  1985-01-11
8  1967-01-31
9  1985-01-23

CodePudding user response:

For a vectorial solution, you can use:

# try to convert to YYYYMMDD
date1 = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')

# get rows for which conversion failed
m = date1.isna()

# try to get end of month
date2 = pd.to_datetime(df.loc[m, 'date'].str[:6], format='%Y%m', errors='coerce').add(pd.offsets.MonthEnd())

# Combine both
df['date2'] = date1.fillna(date2)

NB. Assuming df['date'] is of string dtype. If rather of integer dtype, use df.loc[m, 'date'].floordiv(100) in place of df.loc[m, 'date'].str[:6].

Output:

       date      date2
0  20000123 2000-01-23
1  19850123 1985-01-23
2  19880112 1988-01-12
3  19951201 1995-12-01
4  19850123 1985-01-23
5  20190821 2019-08-21
6  20000512 2000-05-12
7  19850111 1985-01-11
8  19670133 1967-01-31 # invalid replaced by end of month
9  19850123 1985-01-23
  • Related