Home > OS >  Removing invalid values to correctly recreate cumulative data in pandas
Removing invalid values to correctly recreate cumulative data in pandas

Time:10-31

I have a data set with statistics that I collect from text. The processing method sometimes does not work correctly, and I need to correct the output data. I know they are supposed to be cumulative, but sometimes I get incorrect data.

Time series data that should accumulate over time. Right now I'm getting the following, sample snippet:

df
date         value
2021-07-20   21347.0
2021-07-24   21739.0
2021-08-02   22.0
2021-08-03   22.0
2021-08-06   22947.0
2021-08-17   4.0

As you can see, the data is cumulative, but some values are defined incorrectly. I would like such values to be converted to nan.

How can I do that? The final result is expected to be as follows:

df
date         value
2021-07-20   21347.0
2021-07-24   21739.0
2021-08-02   nan
2021-08-03   nan
2021-08-06   22947.0
2021-08-17   nan

CodePudding user response:

You can do that using numpy:

df['value'] = np.where(df['value'] < df['value'][0], np.nan, df['value'])

Output:

   date         value
0  2021-07-20   21347.0
1  2021-07-24   21739.0
2  2021-08-02   nan
3  2021-08-03   nan
4  2021-08-06   22947.0
5  2021-08-17   nan

CodePudding user response:

Can you try this:

import numpy as np
df['check']=df['value'].shift(1)
df['value']=np.where(df['value']>df['check'],df['value'],np.nan)
  • Related