I have a dataframe like this,
Payload Timestamp
3c313034393536353534303b31332e32333534313b303b302e303637313b302e303138313b332e33353b36322e31353338323e 2022-03-16 18:00:08
3c313034393536363134303b31332e32313134383b303b302e303637323b302e3031383b342e3632353b37332e34353133313e 2022-03-16 18:10:07
3c313034393536363734303b31332e31393630313b303b302e303637323b302e3031383b342e303132353b36322e3434323835 2022-03-16 18:20:08
3e 2022-03-16 18:20:11
3c313034393537353734303b31332e31303137343b303b302e3036373b302e303137393b342e383132353b37352e343931393e 2022-03-16 18:30:09
3c313034393538323934303b31332e30353934353b303b302e303636373b302e303137393b332e313337353b37352e37313830 2022-03-16 18:40:08
393e 2022-03-16 18:40:14
Theoretically I get the data (payload column) periodically every X minutes (10 in this example). But if the message is too long, the sending is split.
When this division is done, the messages are sent practically one after the other in a matter of seconds, therefore, I would like to create a new dataframe in which if the difference between a column and the following one is less than X seconds (60 seconds for example) the payload column has the data joined. As for the time column, this new dataframe could show the date and time that both share without seconds.
I would want something like this,
Payload Timestamp
3c313034393536353534303b31332e32333534313b303b302e303637313b302e303138313b332e33353b36322e31353338323e 2022-03-16 18:00
3c313034393536363134303b31332e32313134383b303b302e303637323b302e3031383b342e3632353b37332e34353133313e 2022-03-16 18:10
3c313034393536363734303b31332e31393630313b303b302e303637323b302e3031383b342e303132353b36322e34343238353e 2022-03-16 18:20
3c313034393537353734303b31332e31303137343b303b302e3036373b302e303137393b342e383132353b37352e343931393e 2022-03-16 18:30
3c313034393538323934303b31332e30353934353b303b302e303636373b302e303137393b332e313337353b37352e37313830393e 2022-03-16 18:40
The first I think I have to do is obtain the difference between times
df['time_dif'] = df['Timestamp'].diff()
But I'm not sure what I have to do now. How can I group it? Reading other posts I understand I have to use group_by
but I can't obtain my goal.
I made some attempts like this:
fdf = df.groupby(['Timestamp',pd.Grouper(key = 'Timestamp', freq='1min')])
Can somebody help me?
Thank you!
CodePudding user response:
Try with resample
:
df["Timestamp"] = pd.to_datetime(df["Timestamp"])
output = df.resample("10min",on="Timestamp")["Payload"].sum()
>>> output
Timestamp Payload
2022-03-16 18:00:00 3c313034393536353534303b31332e32333534313b303b302e303637313b302e303138313b332e33353b36322e31353338323e
2022-03-16 18:10:00 3c313034393536363134303b31332e32313134383b303b302e303637323b302e3031383b342e3632353b37332e34353133313e
2022-03-16 18:20:00 3c313034393536363734303b31332e31393630313b303b302e303637323b302e3031383b342e303132353b36322e34343238353e
2022-03-16 18:30:00 3c313034393537353734303b31332e31303137343b303b302e3036373b302e303137393b342e383132353b37352e343931393e
2022-03-16 18:40:00 3c313034393538323934303b31332e30353934353b303b302e303636373b302e303137393b332e313337353b37352e37313830393e
If you don't want to round to the nearest "10 minute", but instead want to use the differences from your first Timestamp, you can do:
output = (df.groupby(df["Timestamp"].diff().dt.total_seconds().gt(60).cumsum())
.agg({"Timestamp": "first", "Payload": "sum"})
)
CodePudding user response:
You can use dt.round
. With dt.round
, if the message arrives early, it will not be grouped with the previous message.
df.groupby(df['Timestamp'].dt.round('10T'))['Payload'].sum().reset_index()[df.columns]
# Output
Payload Timestamp
0 3c313034393536353534303b31332e32333534313b303b302e303637313b302e303138313b332e33353b36322e31353338323e 2022-03-16 18:00:00
1 3c313034393536363134303b31332e32313134383b303b302e303637323b302e3031383b342e3632353b37332e34353133313e 2022-03-16 18:10:00
2 3c313034393536363734303b31332e31393630313b303b302e303637323b302e3031383b342e303132353b36322e34343238353e 2022-03-16 18:20:00
3 3c313034393537353734303b31332e31303137343b303b302e3036373b302e303137393b342e383132353b37352e343931393e 2022-03-16 18:30:00
4 3c313034393538323934303b31332e30353934353b303b302e303636373b302e303137393b332e313337353b37352e37313830393e 2022-03-16 18:40:00
CodePudding user response:
I suggest you use something like:
df.reset_index(inplace=True)
df["Timestamp"] = pd.to_datetime(df["Timestamp"])
df.set_index("Timestamp", inplace=True)
newDf = df.groupby(pd.Grouper(freq="1Min")).sum()
newDf[newDf["Payload"] != 0]
Output
Timestamp | Payload |
---|---|
2022-03-16 18:00:00 | 3c313034393536353534303b31332e32333534313b303b302e303637313b302e303138313b332e33353b36322e31353338323e |
2022-03-16 18:10:00 | 3c313034393536363134303b31332e32313134383b303b302e303637323b302e3031383b342e3632353b37332e34353133313e |
2022-03-16 18:20:00 | 3c313034393536363734303b31332e31393630313b303b302e303637323b302e3031383b342e303132353b36322e34343238353e |
2022-03-16 18:30:00 | 3c313034393537353734303b31332e31303137343b303b302e3036373b302e303137393b342e383132353b37352e343931393e |
2022-03-16 18:40:00 | 3c313034393538323934303b31332e30353934353b303b302e303636373b302e303137393b332e313337353b37352e37313830393e |
Note that in this solution, Timestamp
is the index of the newDf
. So, to get rid of the index and change it to column, as it was before, simply use newDf.reset_index(inplace=True)