Creating a new dataframe from existing dataframe applying datatetime conditions-CodePudding

I have a dataframe like this,

Payload                                                                                                 Timestamp
3c313034393536353534303b31332e32333534313b303b302e303637313b302e303138313b332e33353b36322e31353338323e  2022-03-16 18:00:08
3c313034393536363134303b31332e32313134383b303b302e303637323b302e3031383b342e3632353b37332e34353133313e  2022-03-16 18:10:07
3c313034393536363734303b31332e31393630313b303b302e303637323b302e3031383b342e303132353b36322e3434323835  2022-03-16 18:20:08
3e                                                                                                      2022-03-16 18:20:11
3c313034393537353734303b31332e31303137343b303b302e3036373b302e303137393b342e383132353b37352e343931393e  2022-03-16 18:30:09
3c313034393538323934303b31332e30353934353b303b302e303636373b302e303137393b332e313337353b37352e37313830  2022-03-16 18:40:08
393e                                                                                                    2022-03-16 18:40:14

Theoretically I get the data (payload column) periodically every X minutes (10 in this example). But if the message is too long, the sending is split.

When this division is done, the messages are sent practically one after the other in a matter of seconds, therefore, I would like to create a new dataframe in which if the difference between a column and the following one is less than X seconds (60 seconds for example) the payload column has the data joined. As for the time column, this new dataframe could show the date and time that both share without seconds.

I would want something like this,

Payload                                                                                                     Timestamp
3c313034393536353534303b31332e32333534313b303b302e303637313b302e303138313b332e33353b36322e31353338323e      2022-03-16 18:00
3c313034393536363134303b31332e32313134383b303b302e303637323b302e3031383b342e3632353b37332e34353133313e      2022-03-16 18:10
3c313034393536363734303b31332e31393630313b303b302e303637323b302e3031383b342e303132353b36322e34343238353e    2022-03-16 18:20  
3c313034393537353734303b31332e31303137343b303b302e3036373b302e303137393b342e383132353b37352e343931393e      2022-03-16 18:30
3c313034393538323934303b31332e30353934353b303b302e303636373b302e303137393b332e313337353b37352e37313830393e  2022-03-16 18:40

The first I think I have to do is obtain the difference between times

df['time_dif'] = df['Timestamp'].diff()

But I'm not sure what I have to do now. How can I group it? Reading other posts I understand I have to use group_by but I can't obtain my goal.

I made some attempts like this:

fdf = df.groupby(['Timestamp',pd.Grouper(key = 'Timestamp', freq='1min')])

Can somebody help me?

Thank you!

CodePudding user response：

Try with resample:

df["Timestamp"] = pd.to_datetime(df["Timestamp"])
output = df.resample("10min",on="Timestamp")["Payload"].sum()

>>> output
Timestamp   Payload
2022-03-16 18:00:00 3c313034393536353534303b31332e32333534313b303b302e303637313b302e303138313b332e33353b36322e31353338323e
2022-03-16 18:10:00 3c313034393536363134303b31332e32313134383b303b302e303637323b302e3031383b342e3632353b37332e34353133313e
2022-03-16 18:20:00 3c313034393536363734303b31332e31393630313b303b302e303637323b302e3031383b342e303132353b36322e34343238353e
2022-03-16 18:30:00 3c313034393537353734303b31332e31303137343b303b302e3036373b302e303137393b342e383132353b37352e343931393e
2022-03-16 18:40:00 3c313034393538323934303b31332e30353934353b303b302e303636373b302e303137393b332e313337353b37352e37313830393e

If you don't want to round to the nearest "10 minute", but instead want to use the differences from your first Timestamp, you can do:

output = (df.groupby(df["Timestamp"].diff().dt.total_seconds().gt(60).cumsum())
            .agg({"Timestamp": "first", "Payload": "sum"})
          )

CodePudding user response：

You can use dt.round. With dt.round, if the message arrives early, it will not be grouped with the previous message.

df.groupby(df['Timestamp'].dt.round('10T'))['Payload'].sum().reset_index()[df.columns]

# Output
                                                                                                      Payload           Timestamp
0      3c313034393536353534303b31332e32333534313b303b302e303637313b302e303138313b332e33353b36322e31353338323e 2022-03-16 18:00:00
1      3c313034393536363134303b31332e32313134383b303b302e303637323b302e3031383b342e3632353b37332e34353133313e 2022-03-16 18:10:00
2    3c313034393536363734303b31332e31393630313b303b302e303637323b302e3031383b342e303132353b36322e34343238353e 2022-03-16 18:20:00
3      3c313034393537353734303b31332e31303137343b303b302e3036373b302e303137393b342e383132353b37352e343931393e 2022-03-16 18:30:00
4  3c313034393538323934303b31332e30353934353b303b302e303636373b302e303137393b332e313337353b37352e37313830393e 2022-03-16 18:40:00

CodePudding user response：

I suggest you use something like:

df.reset_index(inplace=True)
df["Timestamp"] = pd.to_datetime(df["Timestamp"])
df.set_index("Timestamp", inplace=True)
newDf = df.groupby(pd.Grouper(freq="1Min")).sum()
newDf[newDf["Payload"] != 0]

Output

Timestamp	Payload
2022-03-16 18:00:00	3c313034393536353534303b31332e32333534313b303b302e303637313b302e303138313b332e33353b36322e31353338323e
2022-03-16 18:10:00	3c313034393536363134303b31332e32313134383b303b302e303637323b302e3031383b342e3632353b37332e34353133313e
2022-03-16 18:20:00	3c313034393536363734303b31332e31393630313b303b302e303637323b302e3031383b342e303132353b36322e34343238353e
2022-03-16 18:30:00	3c313034393537353734303b31332e31303137343b303b302e3036373b302e303137393b342e383132353b37352e343931393e
2022-03-16 18:40:00	3c313034393538323934303b31332e30353934353b303b302e303636373b302e303137393b332e313337353b37352e37313830393e

Note that in this solution, Timestamp is the index of the newDf. So, to get rid of the index and change it to column, as it was before, simply use newDf.reset_index(inplace=True)