I have 2 DataFrame.
import pandas as pd
loc_hour = pd.DataFrame({'id': ['a', 'b', 'c',"d"], 'geohash': ["sybewp", "sws101", "sxk9db","sxr4xt"],"log_date":[20210615,20211219,20210108,20210507],"hour":[12,4,5,19]})
loc_grid = pd.DataFrame({'id': ['a', 'b', 'c',"d"], 'geohash': ["sybewp", "sws101", "sxk9db","sxr4xt"], "gridtype":["Home","Other","Work","Home"]})
These tables contain thousands of different devices.
The loc_hour table contains current location information from devices every hour for a whole year.
In the loc_grid table, there are attributes of the locations where the devices are located. (Home: indicates that the device is at home. Work: indicates that the device is at home.)
What I'm trying to do is calculate how long the devices spend in their home, work or other place as a percentage of the long term (3 months) and short term (1 month).
Desired outputs:
id | home_percent | work_percent | other_percent | total_day_count | total_hour_count | period |
---|---|---|---|---|---|---|
a | 40 | 40 | 20 | 5 | 5 | last 1 month |
b | 50 | 30 | 20 | 10 | 8 | last 1 month |
c | 70 | 20 | 10 | 9 | 7 | last 1 month |
... | ... | ... | ... | ... | ... | ... |
id | home_percent | work_percent | other_percent | total_day_count | total_hour_count | period |
---|---|---|---|---|---|---|
a | 40 | 40 | 20 | 5 | 5 | last 3 month |
b | 50 | 30 | 20 | 10 | 8 | last 3 month |
c | 70 | 20 | 10 | 9 | 7 | last 3 month |
... | ... | ... | ... | ... | ... | ... |
CodePudding user response:
It would have been helpful to provide input data that matches the question. For example, you asked for dataframes with period in last 30 and 90 days but none of your sample data had dates in the last 30 or 90 days.
This is a pretty verbose approach but I did this so you could see what is happening at each step. Hopefully the comments explain:-
# Join the frames together based on the id and geohash columns
hourgrid = pd.merge(loc_hour, loc_grid, how='left', on=['id','geohash'])
# Create a datetime column based on the logdate string
hourgrid['date'] = pd.to_datetime(hourgrid['log_date'], format='%Y%m%d')
current_date = datetime.datetime.now()
# Creates columns to flag if the date is in last 30 or 90 days
hourgrid['last_30_days'] = hourgrid['date'].apply(lambda x: True if x >= (current_date - datetime.timedelta(days=30)) else False)
hourgrid['last_90_days'] = hourgrid['date'].apply(lambda x: True if x >= (current_date - datetime.timedelta(days=90)) else False)
# Create two filtered dataframes
hourgrid_last_30 = hourgrid[hourgrid['last_30_days']==True]
hourgrid_last_90 = hourgrid[hourgrid['last_90_days']==True]
# Group by the ID
hourgrid_last_30_id_groups = hourgrid_last_30.groupby(['id'])
hourgrid_last_90_id_groups = hourgrid_last_90.groupby(['id'])
# Only doing this for 30 days, just repeat for 90 days
final_30_frames = []
for group, frame in hourgrid_last_30_id_groups:
frame['total_hours'] = frame['hour'].sum()
frame['home_hours'] = frame[frame['gridtype']=='Home'].hour.sum()
frame['work_hours'] = frame[frame['gridtype']=='Work'].hour.sum()
frame['other_hours'] = frame[frame['gridtype']=='Other'].hour.sum()
frame['home_percent'] = frame['home_hours'] / frame['total_hours']
frame['work_percent'] = frame['work_hours'] / frame['total_hours']
frame['other_percent'] = frame['other_hours'] / frame['total_hours']
frame['total_day_count'] = frame['date'].nunique()
frame['total_hour_count'] = frame['hour'].sum()
final_30_frames.append(frame)
final_30_df = pd.concat(final_30_frames)
final_30_df = final_30_df[['id', 'home_percent', 'work_percent', 'other_percent', 'total_day_count', 'total_hour_count', 'last_30_days']]
CodePudding user response:
I would merge later pivot and work from there as below:
loc_hour = pd.DataFrame({'id': ['a', 'b', 'c',"d"], 'geohash': ["sybewp", "sws101", "sxk9db","sxr4xt"],"log_date":[20210615,20211219,20210108,20210507],"hour":[12,4,5,19]})
loc_grid = pd.DataFrame({'id': ['a', 'b', 'c',"d"], 'geohash': ["sybewp", "sws101", "sxk9db","sxr4xt"], "gridtype":["Home","Other","Work","Home"]})
df = pd.merge(loc_grid, loc_hour, how='left')
df['log_date'] = pd.to_datetime(df['log_date'], format='%Y%m%d')
df['total_hrs']=df.hour.sum()
total_H=df.hour.sum()
fin_df = df.pivot_table(index=['id','log_date'], columns='gridtype', values = ['hour','total_hrs'],aggfunc=np.sum)
fin_df = fin_df.join(pd.concat({'ratio': fin_df['hour'].div(fin_df['total_hrs'])}, axis=1)).reset_index()
fin_df.columns = ["_".join(a) for a in fin_df.columns.to_flat_index()]
fin_df = fin_df.loc[:,~fin_df.columns.str.startswith('total_hrs_')]