Home > Net >  Remove days with faulty data, Pandas dataframe
Remove days with faulty data, Pandas dataframe

Time:06-08

There are segments of readings that have faulty data and i want to remove entire days which have a least one. I already created the column with the True and False if that segment is wrong.

Example of the dataframe below, since it have more than 100k rows

                           power_c  power_g  temperature  to_delete
date_time                                                          
2019-01-01 00:00:00 00:00     2985        0         10.1      False
2019-01-01 00:05:00 00:00     2258        0         10.1       True
2019-01-01 01:00:00 00:00     2266        0         10.1      False
2019-01-02 00:15:00 00:00     3016        0         10.0      False
2019-01-03 01:20:00 00:00     2265        0         10.0       True

For example the first and second row belong to the same hour on the same day, one of the values has True so i want to delete all rows of that day.

Data always exists in diferences of 5 mins, so i tried to delete 288 items after the True, but since the error is not on the start of the hour it does work as intended.

I am very new to programming and tried a lot of different answers everywhere, i would apreciate very much any help.

CodePudding user response:

Group by the date, then filter out groups that have at least one to_delete.

(df
 .groupby(df.index.date)
 .apply(lambda sf: None if sf['to_delete'].any() else sf)
 .reset_index(level=0, drop=True))
                           power_c  power_g  temperature  to_delete
date_time                                                          
2019-01-02 00:15:00 00:00     3016        0         10.0      False

I'm assuming date_time is a datetime type. If not, convert it first:

df.index = pd.to_datetime(df.index)
  • Related