Home > Enterprise >  Shift specific rows to correct missing values in a Pandas Dataframe
Shift specific rows to correct missing values in a Pandas Dataframe

Time:10-19

Python beginner here.

I couldn't find anything similar to this, but I have the feeling it shouldn't be so hard.

I have a large excel sheet with values from different sensors, but some of the values are missing due to errors in the measurements. So when I put everything into a pandas dataframe I have something like this:

TimeStamp1 Sensor1 TimeStamp2 Sensor2
08:00 100 08:00 60
08:05 102 08:10 40
08:10 105 08:15 50
08:15 101 08:25 31
08:20 103 NaT NaN
08:25 104 NaT NaN

The real dataframe has 7 sensors and more than 100k rows, so there are different numbers of NaT's and NaN's in different columns.

I need timestamps for each sensor to be aligned in order to avoid some inconsistencies. So I want to shift the lines in TimeStamp2 and Sensor2 from the point where it differs from TimeStamp1, add the missing time and a NaN (or empty) value in the position in Sensor2, and make the NaT and NaN at the end disappear from both columns.

An output like this:

TimeStamp1 Sensor1 TimeStamp2 Sensor2
08:00 100 08:00 60
08:05 102 08:05 Empty (NaN)
08:10 105 08:10 40
08:15 101 08:15 50
08:20 103 08:20 Empty (NaN)
08:25 104 08:25 31

I guess I could simplify the question by asking a way to insert a specific element in a specific row of a specific column. All shifting examples I've seen will shift the entire column up or down. Is there an easy way to do this?

If it's easier, this solution also works for me:

TimeStamp Sensor1 Sensor2
08:00 100 60
08:05 102 Empty (NaN)
08:10 105 40
08:15 101 50
08:20 103 Empty (NaN)
08:25 104 31

CodePudding user response:

This will work if your data is setup exactly as your example, otherwise you'll have to adapt for your data.

# change timestamps columns to datetime. You don't say if there's a date component, so you may have to get your timestamps in order before moving on.
timestamps = df.filter(regex='TimeStamp').columns.tolist()
for t in timestamps:
    df[t] = pd.to_datetime(df[t])

# get the max and min of all datetimes in the timestamp columns
end = df.filter(regex='TimeStamp').max().max()
start = df.filter(regex='TimeStamp').min().min()

# create a new date range
new_dates = pd.date_range(start=start, end=end, freq='5Min')

# get columns for iterations - should only be even and contain timestamp and sensor columns as your example shows
num_columns = df.shape[1]

# iterate and concat
dflist = []
for i in range(0, num_columns, 2):
    print(i)
    d = df.iloc[:, i:i 2].set_index(df.iloc[:, i].name).dropna().reindex(new_dates)
    dflist.append(d)
pd.concat(dflist, axis=1)

                     Sensor1  Sensor2
2021-10-18 08:00:00      100     60.0
2021-10-18 08:05:00      102      NaN
2021-10-18 08:10:00      105     40.0
2021-10-18 08:15:00      101     50.0
2021-10-18 08:20:00      103      NaN
2021-10-18 08:25:00      104     31.0

CodePudding user response:

@ti7's suggestion is spot on; split the dataframe into individual frames, merge and fillna :

sensor1 = df.filter(like='1')
sensor2 = df.filter(like='2')
(sensor1.merge(sensor2, 
               how = 'outer', 
               left_on='TimeStamp1', 
               right_on = 'TimeStamp2', 
               sort = True)
        .fillna({"TimeStamp2" : df.TimeStamp1})
        .dropna(subset=['TimeStamp1'])
) 
  TimeStamp1  Sensor1 TimeStamp2  Sensor2
0      08:00    100.0      08:00     60.0
1      08:05    102.0      08:05      NaN
2      08:10    105.0      08:10     40.0
3      08:15    101.0      08:15     50.0
4      08:20    103.0      08:20      NaN
5      08:25    104.0      08:25     31.0
  • Related