Python beginner here.
I couldn't find anything similar to this, but I have the feeling it shouldn't be so hard.
I have a large excel sheet with values from different sensors, but some of the values are missing due to errors in the measurements. So when I put everything into a pandas dataframe I have something like this:
TimeStamp1 | Sensor1 | TimeStamp2 | Sensor2 |
---|---|---|---|
08:00 | 100 | 08:00 | 60 |
08:05 | 102 | 08:10 | 40 |
08:10 | 105 | 08:15 | 50 |
08:15 | 101 | 08:25 | 31 |
08:20 | 103 | NaT | NaN |
08:25 | 104 | NaT | NaN |
The real dataframe has 7 sensors and more than 100k rows, so there are different numbers of NaT's and NaN's in different columns.
I need timestamps for each sensor to be aligned in order to avoid some inconsistencies. So I want to shift the lines in TimeStamp2 and Sensor2 from the point where it differs from TimeStamp1, add the missing time and a NaN (or empty) value in the position in Sensor2, and make the NaT and NaN at the end disappear from both columns.
An output like this:
TimeStamp1 | Sensor1 | TimeStamp2 | Sensor2 |
---|---|---|---|
08:00 | 100 | 08:00 | 60 |
08:05 | 102 | 08:05 | Empty (NaN) |
08:10 | 105 | 08:10 | 40 |
08:15 | 101 | 08:15 | 50 |
08:20 | 103 | 08:20 | Empty (NaN) |
08:25 | 104 | 08:25 | 31 |
I guess I could simplify the question by asking a way to insert a specific element in a specific row of a specific column. All shifting examples I've seen will shift the entire column up or down. Is there an easy way to do this?
If it's easier, this solution also works for me:
TimeStamp | Sensor1 | Sensor2 |
---|---|---|
08:00 | 100 | 60 |
08:05 | 102 | Empty (NaN) |
08:10 | 105 | 40 |
08:15 | 101 | 50 |
08:20 | 103 | Empty (NaN) |
08:25 | 104 | 31 |
CodePudding user response:
This will work if your data is setup exactly as your example, otherwise you'll have to adapt for your data.
# change timestamps columns to datetime. You don't say if there's a date component, so you may have to get your timestamps in order before moving on.
timestamps = df.filter(regex='TimeStamp').columns.tolist()
for t in timestamps:
df[t] = pd.to_datetime(df[t])
# get the max and min of all datetimes in the timestamp columns
end = df.filter(regex='TimeStamp').max().max()
start = df.filter(regex='TimeStamp').min().min()
# create a new date range
new_dates = pd.date_range(start=start, end=end, freq='5Min')
# get columns for iterations - should only be even and contain timestamp and sensor columns as your example shows
num_columns = df.shape[1]
# iterate and concat
dflist = []
for i in range(0, num_columns, 2):
print(i)
d = df.iloc[:, i:i 2].set_index(df.iloc[:, i].name).dropna().reindex(new_dates)
dflist.append(d)
pd.concat(dflist, axis=1)
Sensor1 Sensor2
2021-10-18 08:00:00 100 60.0
2021-10-18 08:05:00 102 NaN
2021-10-18 08:10:00 105 40.0
2021-10-18 08:15:00 101 50.0
2021-10-18 08:20:00 103 NaN
2021-10-18 08:25:00 104 31.0
CodePudding user response:
@ti7's suggestion is spot on; split the dataframe into individual frames, merge and fillna :
sensor1 = df.filter(like='1')
sensor2 = df.filter(like='2')
(sensor1.merge(sensor2,
how = 'outer',
left_on='TimeStamp1',
right_on = 'TimeStamp2',
sort = True)
.fillna({"TimeStamp2" : df.TimeStamp1})
.dropna(subset=['TimeStamp1'])
)
TimeStamp1 Sensor1 TimeStamp2 Sensor2
0 08:00 100.0 08:00 60.0
1 08:05 102.0 08:05 NaN
2 08:10 105.0 08:10 40.0
3 08:15 101.0 08:15 50.0
4 08:20 103.0 08:20 NaN
5 08:25 104.0 08:25 31.0