Home > Software engineering >  Round time during weekends in dataframe to the next Monday
Round time during weekends in dataframe to the next Monday

Time:09-28

I'm trying to round the time in my twitter data to the nearest 5-minute interval to match the stock price data (5-minutes window). Currently, it works fine for data between 9:30 - 16:00 on trading days and after 16:00 to the next morning (if its not a Friday). But, since 2021/04/09 is a Friday and 2020/04/10 is a Saturday, the tweets tweeted after Friday 16:00 need to be mapped to the next Monday 9:30 (2021/04/12 in this case).

Input:

tweet time
tweet1 2021-04-08 18:42:48
tweet2 2021-04-09 00:42:48
tweet3 2021-04-09 09:42:48
tweet4 2021-04-09 17:42:48
tweet5 2021-04-10 11:42:48

price:

price time
100 2021-04-08 16:00:00
110 2021-04-09 09:30:00
120 2021-04-09 09:45:00
130 2021-04-12 09:30:00

Expected output:

tweet time price
tweet1 2021-04-09 09:30:00 110
tweet2 2021-04-09 09:30:00 110
tweet3 2021-04-09 09:45:00 120
tweet4 2021-04-12 09:30:00 130
tweet5 2021-04-12 09:30:00 130

This is what I have done so far, which does not cover the tweets tweeted during weekends and Friday after 16:00

def time_map(row):
    new_col_tmp = row['new_col_tmp']
    new_col2 = row['new_col2']
    
    if  930 <= int(new_col_tmp) <= 1600:
        convert = new_col2   " "   new_col_tmp
    elif int(new_col_tmp) < 930 or int(new_col_tmp) > 1600:
        convert = new_col2   " "   '0930'

    return convert

tweet['new_col1'] = tweet.created_at.round("5min").dt.strftime('%H:%M')
tweet['new_col_tmp'] = tweet.created_at.round("5min").dt.strftime('%H%M')
tweet['new_col2'] = tweet.created_at.round("5min").dt.strftime('%Y%m%d')
tweet['convert'] = tweet.apply(lambda row: time_map(row), axis=1)
tweet['convert'] = pd.to_datetime(tweet.convert, format='%Y%m%d %H%M').dt.strftime('%d/%m/%Y %H:%M')

How could I solve the case for weekends? Any help is appreciated!

CodePudding user response:

Given df and df2:

    tweet                time
0  tweet1 2021-04-08 18:42:48
1  tweet2 2021-04-09 00:42:48
2  tweet3 2021-04-09 09:42:48
3  tweet4 2021-04-09 17:42:48
4  tweet5 2021-04-10 11:42:48

   price                time
0    100 2021-04-08 16:00:00
1    110 2021-04-09 09:30:00
2    120 2021-04-09 09:45:00
3    130 2021-04-12 09:30:00

Doing:

# Outer merge the two dataframes on time.
# Sort this new dataframe by time.
out = df.merge(df2, on='time', how='outer').sort_values('time')

# Anywhere that price is NaN came from df.
# Now that it's sorted, we want to remove these time values.
out.loc[out.price.isna(), 'time'] = pd.NaT

# Now, we backfill the data, drop rows with nans,
# and keep only the last row of each tweet.
out = out.bfill().dropna().drop_duplicates('tweet', keep='last')
print(out)

Output:

                 time   tweet  price
0 2021-04-09 09:30:00  tweet1  110.0
1 2021-04-09 09:30:00  tweet2  110.0
2 2021-04-09 09:45:00  tweet3  120.0
3 2021-04-12 09:30:00  tweet4  130.0
4 2021-04-12 09:30:00  tweet5  130.0

CodePudding user response:

I created a function to round time based on your requirement. Here it only considers weekends, but I suppose in your real case you may also need to consider holidays, so you may change accordingly

import datetime
def round_time(timestamp):
    # round to nearest 5 min
    timestamp = timestamp.round('5min')
    # if friday after 16:00 or weekend
    if timestamp.weekday() > 4 or (timestamp.weekday() == 4 and timestamp.time() > pd.to_datetime('16:00').time()):
        timestamp = pd.to_datetime((timestamp.date()   datetime.timedelta(7 - timestamp.weekday())).strftime('%Y-%m-%d')   ' 9:30')
    else:
        # if before 9:30
        if timestamp.time() < pd.to_datetime('9:30').time():
            timestamp = pd.to_datetime(timestamp.date().strftime('%Y-%m-%d')   ' 9:30')
        # if after 16:00
        if timestamp.time() > pd.to_datetime('16:00').time():
            timestamp = pd.to_datetime((timestamp.date()   datetime.timedelta(1)).strftime('%Y-%m-%d')   ' 9:30')
    return timestamp
  • Related