I have a large time-indexed Pandas DataFrame
with time-series data of a couple of devices. The structure of this DataFrame
(in code below self._combined_data_frame
) looks like this:
DateTimeIndex|device_name|col1|...|colN
The DateTimeIndex
and device_name
are filled for every row, the other columns contain nan
values. Sample data is available on Google Drive:
CodePudding user response:
IIUC use merge_asof
:
reference_timestamps = ['2017-12-31','2019-01-31','2020-12-31']
df1 = pd.DataFrame({'reference_timestamp': pd.to_datetime(reference_timestamps)})
DataFrame = pd.merge_asof(DataFrame.reset_index(),
df1,
left_on='DateTimeIndex',
right_on='reference_timestamp',
direction='forward')
print (DataFrame)
DateTimeIndex device_name col reference_timestamp
0 2016-11-30 A NaN 2017-12-31
1 2017-04-30 A NaN 2017-12-31
2 2018-01-31 A NaN 2019-01-31
3 2019-09-30 A NaN 2020-12-31
4 2020-04-30 A NaN 2020-12-31
5 2020-11-30 A NaN 2020-12-31
CodePudding user response:
Using jezrael's solution I modified my code such that it works with the sample data provided. This is my working solution:
Method integrate_trafo_conacq_rawdata
needed to be adjusted:
def integrate_trafo_contacq_rawdata(self, nomen: str, groupby_column: str, drop_columns: list):
print('opening:', nomen)
device_dataset: DataFrame = DataFrame(pd.read_hdf(self._input_file_name, key=nomen)).groupby(groupby_column).sum()
device_dataset = device_dataset.drop(columns=drop_columns)
for column in device_dataset.columns:
device_dataset = device_dataset.rename(columns={column: '_'.join([column, 'sum'])})
self._lastinjection_items = device_dataset.index.to_frame()
device_dataset['nomen'] = '_'.join([nomen.split('_')[0], 'integral'])
device_dataset.index = pd.to_datetime(device_dataset.index)
self._combined_data_frame = pd.concat([self._combined_data_frame, device_dataset])
return self
Updated method add_lastinjection_markers
def add_lastinjection_markers(self):
reference_timestamps = self._lastinjection_items.rename(columns={'lastinjection': 'reference_timestamp'})
frame = self._combined_data_frame.reset_index().rename(columns={'index': 'time'})
frame['time'] = frame['time'].apply(lambda item: item.value)
self._combined_data_frame = pd.merge_asof(frame.sort_values(by='time'),
reference_timestamps,
left_on='time',
right_on='reference_timestamp',
direction='forward')