I'm trying to round the time in my twitter data to the nearest 5-minute interval to match the stock price data (5-minutes window). Currently, it works fine for data between 9:30 - 16:00 on trading days and after 16:00 to the next morning (if its not a Friday). But, since 2021/04/09 is a Friday and 2020/04/10 is a Saturday, the tweets tweeted after Friday 16:00 need to be mapped to the next Monday 9:30 (2021/04/12 in this case).
Input:
tweet | time |
---|---|
tweet1 | 2021-04-08 18:42:48 |
tweet2 | 2021-04-09 00:42:48 |
tweet3 | 2021-04-09 09:42:48 |
tweet4 | 2021-04-09 17:42:48 |
tweet5 | 2021-04-10 11:42:48 |
price:
price | time |
---|---|
100 | 2021-04-08 16:00:00 |
110 | 2021-04-09 09:30:00 |
120 | 2021-04-09 09:45:00 |
130 | 2021-04-12 09:30:00 |
Expected output:
tweet | time | price |
---|---|---|
tweet1 | 2021-04-09 09:30:00 | 110 |
tweet2 | 2021-04-09 09:30:00 | 110 |
tweet3 | 2021-04-09 09:45:00 | 120 |
tweet4 | 2021-04-12 09:30:00 | 130 |
tweet5 | 2021-04-12 09:30:00 | 130 |
This is what I have done so far, which does not cover the tweets tweeted during weekends and Friday after 16:00
def time_map(row):
new_col_tmp = row['new_col_tmp']
new_col2 = row['new_col2']
if 930 <= int(new_col_tmp) <= 1600:
convert = new_col2 " " new_col_tmp
elif int(new_col_tmp) < 930 or int(new_col_tmp) > 1600:
convert = new_col2 " " '0930'
return convert
tweet['new_col1'] = tweet.created_at.round("5min").dt.strftime('%H:%M')
tweet['new_col_tmp'] = tweet.created_at.round("5min").dt.strftime('%H%M')
tweet['new_col2'] = tweet.created_at.round("5min").dt.strftime('%Y%m%d')
tweet['convert'] = tweet.apply(lambda row: time_map(row), axis=1)
tweet['convert'] = pd.to_datetime(tweet.convert, format='%Y%m%d %H%M').dt.strftime('%d/%m/%Y %H:%M')
How could I solve the case for weekends? Any help is appreciated!
CodePudding user response:
Given df
and df2
:
tweet time
0 tweet1 2021-04-08 18:42:48
1 tweet2 2021-04-09 00:42:48
2 tweet3 2021-04-09 09:42:48
3 tweet4 2021-04-09 17:42:48
4 tweet5 2021-04-10 11:42:48
price time
0 100 2021-04-08 16:00:00
1 110 2021-04-09 09:30:00
2 120 2021-04-09 09:45:00
3 130 2021-04-12 09:30:00
Doing:
# Outer merge the two dataframes on time.
# Sort this new dataframe by time.
out = df.merge(df2, on='time', how='outer').sort_values('time')
# Anywhere that price is NaN came from df.
# Now that it's sorted, we want to remove these time values.
out.loc[out.price.isna(), 'time'] = pd.NaT
# Now, we backfill the data, drop rows with nans,
# and keep only the last row of each tweet.
out = out.bfill().dropna().drop_duplicates('tweet', keep='last')
print(out)
Output:
time tweet price
0 2021-04-09 09:30:00 tweet1 110.0
1 2021-04-09 09:30:00 tweet2 110.0
2 2021-04-09 09:45:00 tweet3 120.0
3 2021-04-12 09:30:00 tweet4 130.0
4 2021-04-12 09:30:00 tweet5 130.0
CodePudding user response:
I created a function to round time based on your requirement. Here it only considers weekends, but I suppose in your real case you may also need to consider holidays, so you may change accordingly
import datetime
def round_time(timestamp):
# round to nearest 5 min
timestamp = timestamp.round('5min')
# if friday after 16:00 or weekend
if timestamp.weekday() > 4 or (timestamp.weekday() == 4 and timestamp.time() > pd.to_datetime('16:00').time()):
timestamp = pd.to_datetime((timestamp.date() datetime.timedelta(7 - timestamp.weekday())).strftime('%Y-%m-%d') ' 9:30')
else:
# if before 9:30
if timestamp.time() < pd.to_datetime('9:30').time():
timestamp = pd.to_datetime(timestamp.date().strftime('%Y-%m-%d') ' 9:30')
# if after 16:00
if timestamp.time() > pd.to_datetime('16:00').time():
timestamp = pd.to_datetime((timestamp.date() datetime.timedelta(1)).strftime('%Y-%m-%d') ' 9:30')
return timestamp