I am using the yelp dataset to get the hours multiple businesses are open. The time range is put in a list full of dictionaries for each company as shown below.
{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', 'Wednesday': '8:0-18:30', 'Thursday': '8:0-18:30', 'Friday': '8:0-18:30', 'Saturday': '8:0-14:0'}
{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', 'Wednesday': '8:0-22:0', 'Thursday': '8:0-22:0', 'Friday': '8:0-23:0', 'Saturday': '8:0-23:0', 'Sunday': '8:0-22:0'}
{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', 'Wednesday': '7:0-20:0', 'Thursday': '7:0-20:0', 'Friday': '7:0-21:0', 'Saturday': '7:0-21:0', 'Sunday': '7:0-21:0'}
{'Wednesday': '14:0-22:0', 'Thursday': '16:0-22:0', 'Friday': '12:0-22:0', 'Saturday': '12:0-22:0', 'Sunday': '12:0-18:0'}
{'Monday': '0:0-0:0', 'Tuesday': '6:0-22:0', 'Wednesday': '6:0-22:0', 'Thursday': '6:0-22:0', 'Friday': '9:0-0:0', 'Saturday': '9:0-22:0', 'Sunday': '8:0-22:0'}
{'Monday': '0:0-0:0', 'Tuesday': '10:0-18:0', 'Wednesday': '10:0-18:0', 'Thursday': '10:0-18:0', 'Friday': '10:0-18:0', 'Saturday': '10:0-18:0', 'Sunday': '12:0-18:0'}
{'Monday': '9:0-17:0', 'Tuesday': '9:0-17:0', 'Wednesday': '9:0-17:0', 'Thursday': '9:0-17:0', 'Friday': '9:0-17:0'}
None
{'Monday': '0:0-0:0', 'Tuesday': '6:0-21:0', 'Wednesday': '6:0-21:0', 'Thursday': '6:0-16:0', 'Friday': '6:0-16:0', 'Saturday': '6:0-17:0', 'Sunday': '6:0-21:0'}
This goes on for 150,000 elements.
As you can see, one element does not have any information and says "None" instead. I used the dropna() command to clear those. However, it leaves a gap in the list and disrupts a for loop I have that calculates the hours.
Here is an small example to explain what I mean.
The table starts as this
index,0
0,0.0
1,1.0
2,3.0
3,NaN
4,4.0
5,5.0.
and changes to
index,0
0,0.0
1,1.0
2,3.0
4,4.0
5,5.0
after using dropna()
As you can see, the table skips from 2 to 4.
In my for loop I made it calculate the total time for each week in a range(1-150000), but since there is no row 8 and other rows, it throws an error and stops the loop.
So... My question is, how can I run my for loop such that is skips over these non existent rows.
Also, here is my code:
df_new = df_hours.dropna()
for i in range(1,150000):
dc = df_new[i]
print(dc)
sum_elapsed = 0
for _, v in dc.items():
start, end = v.split('-')
hhs, mms = (int(v) for v in start.split(':'))
hhe, mme = (int(v) for v in end.split(':'))
elapsed = (hhe * 60 mme) - (hhs * 60 mms)
sum_elapsed = elapsed
print(sum_elapsed)
CodePudding user response:
Given a list of dict:
data = [{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', 'Wednesday': '8:0-18:30', 'Thursday': '8:0-18:30', 'Friday': '8:0-18:30', 'Saturday': '8:0-14:0'},
{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', 'Wednesday': '8:0-22:0', 'Thursday': '8:0-22:0', 'Friday': '8:0-23:0', 'Saturday': '8:0-23:0', 'Sunday': '8:0-22:0'},
{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', 'Wednesday': '7:0-20:0', 'Thursday': '7:0-20:0', 'Friday': '7:0-21:0', 'Saturday': '7:0-21:0', 'Sunday': '7:0-21:0'},
{'Wednesday': '14:0-22:0', 'Thursday': '16:0-22:0', 'Friday': '12:0-22:0', 'Saturday': '12:0-22:0', 'Sunday': '12:0-18:0'},
{'Monday': '0:0-0:0', 'Tuesday': '6:0-22:0', 'Wednesday': '6:0-22:0', 'Thursday': '6:0-22:0', 'Friday': '9:0-0:0', 'Saturday': '9:0-22:0', 'Sunday': '8:0-22:0'},
{'Monday': '0:0-0:0', 'Tuesday': '10:0-18:0', 'Wednesday': '10:0-18:0', 'Thursday': '10:0-18:0', 'Friday': '10:0-18:0', 'Saturday': '10:0-18:0', 'Sunday': '12:0-18:0'},
{'Monday': '9:0-17:0', 'Tuesday': '9:0-17:0', 'Wednesday': '9:0-17:0', 'Thursday': '9:0-17:0', 'Friday': '9:0-17:0'},
None,
{'Monday': '0:0-0:0', 'Tuesday': '6:0-21:0', 'Wednesday': '6:0-21:0', 'Thursday': '6:0-16:0', 'Friday': '6:0-16:0', 'Saturday': '6:0-17:0', 'Sunday': '6:0-21:0'}]
Doing:
df = pd.json_normalize(data)
print(df)
Output:
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
0 0:0-0:0 8:0-18:30 8:0-18:30 8:0-18:30 8:0-18:30 8:0-14:0 NaN
1 8:0-22:0 8:0-22:0 8:0-22:0 8:0-22:0 8:0-23:0 8:0-23:0 8:0-22:0
2 7:0-20:0 7:0-20:0 7:0-20:0 7:0-20:0 7:0-21:0 7:0-21:0 7:0-21:0
3 NaN NaN 14:0-22:0 16:0-22:0 12:0-22:0 12:0-22:0 12:0-18:0
4 0:0-0:0 6:0-22:0 6:0-22:0 6:0-22:0 9:0-0:0 9:0-22:0 8:0-22:0
5 0:0-0:0 10:0-18:0 10:0-18:0 10:0-18:0 10:0-18:0 10:0-18:0 12:0-18:0
6 9:0-17:0 9:0-17:0 9:0-17:0 9:0-17:0 9:0-17:0 NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN
8 0:0-0:0 6:0-21:0 6:0-21:0 6:0-16:0 6:0-16:0 6:0-17:0 6:0-21:0
Removing NaN rows, and resetting the index:
df = df.dropna(how='all').reset_index(drop=True)
print(df)
Output:
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
0 0:0-0:0 8:0-18:30 8:0-18:30 8:0-18:30 8:0-18:30 8:0-14:0 NaN
1 8:0-22:0 8:0-22:0 8:0-22:0 8:0-22:0 8:0-23:0 8:0-23:0 8:0-22:0
2 7:0-20:0 7:0-20:0 7:0-20:0 7:0-20:0 7:0-21:0 7:0-21:0 7:0-21:0
3 NaN NaN 14:0-22:0 16:0-22:0 12:0-22:0 12:0-22:0 12:0-18:0
4 0:0-0:0 6:0-22:0 6:0-22:0 6:0-22:0 9:0-0:0 9:0-22:0 8:0-22:0
5 0:0-0:0 10:0-18:0 10:0-18:0 10:0-18:0 10:0-18:0 10:0-18:0 12:0-18:0
6 9:0-17:0 9:0-17:0 9:0-17:0 9:0-17:0 9:0-17:0 NaN NaN
7 0:0-0:0 6:0-21:0 6:0-21:0 6:0-16:0 6:0-16:0 6:0-17:0 6:0-21:0
Doing your function correctly:
def myFunc(v):
if isinstance(v, str):
start, end = v.split('-')
hhs, mms = (int(v) for v in start.split(':'))
hhe, mme = (int(v) for v in end.split(':'))
return (hhe * 60 mme) - (hhs * 60 mms)
else:
return v
df = df.applymap(myFunc)
Output:
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
0 0.0 630.0 630.0 630.0 630.0 360.0 NaN
1 840.0 840.0 840.0 840.0 900.0 900.0 840.0
2 780.0 780.0 780.0 780.0 840.0 840.0 840.0
3 NaN NaN 480.0 360.0 600.0 600.0 360.0
4 0.0 960.0 960.0 960.0 -540.0 780.0 840.0
5 0.0 480.0 480.0 480.0 480.0 480.0 360.0
6 480.0 480.0 480.0 480.0 480.0 NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN
8 0.0 900.0 900.0 600.0 600.0 660.0 900.0
Total sum:
>>> df.sum().sum()
30600.0
Vectorized(ish), and working with timedelta's instead of re-inventing the wheel:
for col in df:
x = df[col].str.extract('(\d*):(\d*)-(\d*):(\d*)').astype(float)
h = x.loc[:, ::2].apply(pd.to_timedelta, unit='h')
m = x.loc[:, 1::2].apply(pd.to_timedelta, unit='m')
x = h m.values
df[col] = x[2] - x[0]
>>> df.sum().sum()
Timedelta('21 days 06:00:00')
>>> df.sum().sum().total_seconds() // 60
30600.0
>>> df
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
0 0 days 00:00:00 0 days 10:30:00 0 days 10:30:00 0 days 10:30:00 0 days 10:30:00 0 days 06:00:00 NaT
1 0 days 14:00:00 0 days 14:00:00 0 days 14:00:00 0 days 14:00:00 0 days 15:00:00 0 days 15:00:00 0 days 14:00:00
2 0 days 13:00:00 0 days 13:00:00 0 days 13:00:00 0 days 13:00:00 0 days 14:00:00 0 days 14:00:00 0 days 14:00:00
3 NaT NaT 0 days 08:00:00 0 days 06:00:00 0 days 10:00:00 0 days 10:00:00 0 days 06:00:00
4 0 days 00:00:00 0 days 16:00:00 0 days 16:00:00 0 days 16:00:00 -1 days 15:00:00 0 days 13:00:00 0 days 14:00:00
5 0 days 00:00:00 0 days 08:00:00 0 days 08:00:00 0 days 08:00:00 0 days 08:00:00 0 days 08:00:00 0 days 06:00:00
6 0 days 08:00:00 0 days 08:00:00 0 days 08:00:00 0 days 08:00:00 0 days 08:00:00 NaT NaT
7 0 days 00:00:00 0 days 15:00:00 0 days 15:00:00 0 days 10:00:00 0 days 10:00:00 0 days 11:00:00 0 days 15:00:00
CodePudding user response:
First of all, df_new[i]
indexes df_new
along the columns, not the index (as it seems you want). Instead, use df_new.loc[i]
, which does index along the columns. Additionally, you can use a try
/except
statement to catch any raised KeyError
exceptions.
df_new = df_hours.dropna()
for i in range(1,150000):
try:
dc = df_new.loc[i]
except KeyError:
continue
...
However, you should just iterate over the index
of the DataFrame
, which avoids any KeyError
exceptions:
df_new = df_hours.dropna()
for i in df_new.index:
...