Shift specific rows to correct missing values in a Pandas Dataframe-CodePudding

Python beginner here.

I couldn't find anything similar to this, but I have the feeling it shouldn't be so hard.

I have a large excel sheet with values from different sensors, but some of the values are missing due to errors in the measurements. So when I put everything into a pandas dataframe I have something like this:

TimeStamp1	Sensor1	TimeStamp2	Sensor2
08:00	100	08:00	60
08:05	102	08:10	40
08:10	105	08:15	50
08:15	101	08:25	31
08:20	103	NaT	NaN
08:25	104	NaT	NaN

The real dataframe has 7 sensors and more than 100k rows, so there are different numbers of NaT's and NaN's in different columns.

I need timestamps for each sensor to be aligned in order to avoid some inconsistencies. So I want to shift the lines in TimeStamp2 and Sensor2 from the point where it differs from TimeStamp1, add the missing time and a NaN (or empty) value in the position in Sensor2, and make the NaT and NaN at the end disappear from both columns.

An output like this:

TimeStamp1	Sensor1	TimeStamp2	Sensor2
08:00	100	08:00	60
08:05	102	08:05	Empty (NaN)
08:10	105	08:10	40
08:15	101	08:15	50
08:20	103	08:20	Empty (NaN)
08:25	104	08:25	31

I guess I could simplify the question by asking a way to insert a specific element in a specific row of a specific column. All shifting examples I've seen will shift the entire column up or down. Is there an easy way to do this?

If it's easier, this solution also works for me:

TimeStamp	Sensor1	Sensor2
08:00	100	60
08:05	102	Empty (NaN)
08:10	105	40
08:15	101	50
08:20	103	Empty (NaN)
08:25	104	31

CodePudding user response：

This will work if your data is setup exactly as your example, otherwise you'll have to adapt for your data.

# change timestamps columns to datetime. You don't say if there's a date component, so you may have to get your timestamps in order before moving on.
timestamps = df.filter(regex='TimeStamp').columns.tolist()
for t in timestamps:
    df[t] = pd.to_datetime(df[t])

# get the max and min of all datetimes in the timestamp columns
end = df.filter(regex='TimeStamp').max().max()
start = df.filter(regex='TimeStamp').min().min()

# create a new date range
new_dates = pd.date_range(start=start, end=end, freq='5Min')

# get columns for iterations - should only be even and contain timestamp and sensor columns as your example shows
num_columns = df.shape[1]

# iterate and concat
dflist = []
for i in range(0, num_columns, 2):
    print(i)
    d = df.iloc[:, i:i 2].set_index(df.iloc[:, i].name).dropna().reindex(new_dates)
    dflist.append(d)
pd.concat(dflist, axis=1)

                     Sensor1  Sensor2
2021-10-18 08:00:00      100     60.0
2021-10-18 08:05:00      102      NaN
2021-10-18 08:10:00      105     40.0
2021-10-18 08:15:00      101     50.0
2021-10-18 08:20:00      103      NaN
2021-10-18 08:25:00      104     31.0

CodePudding user response：

@ti7's suggestion is spot on; split the dataframe into individual frames, merge and fillna :

sensor1 = df.filter(like='1')
sensor2 = df.filter(like='2')
(sensor1.merge(sensor2, 
               how = 'outer', 
               left_on='TimeStamp1', 
               right_on = 'TimeStamp2', 
               sort = True)
        .fillna({"TimeStamp2" : df.TimeStamp1})
        .dropna(subset=['TimeStamp1'])
) 
  TimeStamp1  Sensor1 TimeStamp2  Sensor2
0      08:00    100.0      08:00     60.0
1      08:05    102.0      08:05      NaN
2      08:10    105.0      08:10     40.0
3      08:15    101.0      08:15     50.0
4      08:20    103.0      08:20      NaN
5      08:25    104.0      08:25     31.0