How to move all list elements up after removing empty elements-CodePudding

I am using the yelp dataset to get the hours multiple businesses are open. The time range is put in a list full of dictionaries for each company as shown below.

{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', 'Wednesday': '8:0-18:30', 'Thursday': '8:0-18:30', 'Friday': '8:0-18:30', 'Saturday': '8:0-14:0'}
{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', 'Wednesday': '8:0-22:0', 'Thursday': '8:0-22:0', 'Friday': '8:0-23:0', 'Saturday': '8:0-23:0', 'Sunday': '8:0-22:0'}
{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', 'Wednesday': '7:0-20:0', 'Thursday': '7:0-20:0', 'Friday': '7:0-21:0', 'Saturday': '7:0-21:0', 'Sunday': '7:0-21:0'}
{'Wednesday': '14:0-22:0', 'Thursday': '16:0-22:0', 'Friday': '12:0-22:0', 'Saturday': '12:0-22:0', 'Sunday': '12:0-18:0'}
{'Monday': '0:0-0:0', 'Tuesday': '6:0-22:0', 'Wednesday': '6:0-22:0', 'Thursday': '6:0-22:0', 'Friday': '9:0-0:0', 'Saturday': '9:0-22:0', 'Sunday': '8:0-22:0'}
{'Monday': '0:0-0:0', 'Tuesday': '10:0-18:0', 'Wednesday': '10:0-18:0', 'Thursday': '10:0-18:0', 'Friday': '10:0-18:0', 'Saturday': '10:0-18:0', 'Sunday': '12:0-18:0'}
{'Monday': '9:0-17:0', 'Tuesday': '9:0-17:0', 'Wednesday': '9:0-17:0', 'Thursday': '9:0-17:0', 'Friday': '9:0-17:0'}
None
{'Monday': '0:0-0:0', 'Tuesday': '6:0-21:0', 'Wednesday': '6:0-21:0', 'Thursday': '6:0-16:0', 'Friday': '6:0-16:0', 'Saturday': '6:0-17:0', 'Sunday': '6:0-21:0'}

This goes on for 150,000 elements.

As you can see, one element does not have any information and says "None" instead. I used the dropna() command to clear those. However, it leaves a gap in the list and disrupts a for loop I have that calculates the hours.

Here is an small example to explain what I mean.

The table starts as this

index,0
0,0.0
1,1.0
2,3.0
3,NaN
4,4.0
5,5.0.

and changes to

index,0
0,0.0
1,1.0
2,3.0
4,4.0
5,5.0

after using dropna()

As you can see, the table skips from 2 to 4.

In my for loop I made it calculate the total time for each week in a range(1-150000), but since there is no row 8 and other rows, it throws an error and stops the loop.

So... My question is, how can I run my for loop such that is skips over these non existent rows.

Also, here is my code:

df_new = df_hours.dropna()

for i in range(1,150000):

  dc = df_new[i]
  print(dc)
  sum_elapsed = 0
  for _, v in dc.items():
      start, end = v.split('-')
      hhs, mms = (int(v) for v in start.split(':'))
      hhe, mme = (int(v) for v in end.split(':'))
      elapsed = (hhe * 60   mme) - (hhs * 60   mms) 
      sum_elapsed  = elapsed

  print(sum_elapsed)

CodePudding user response：

Given a list of dict:

data = [{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', 'Wednesday': '8:0-18:30', 'Thursday': '8:0-18:30', 'Friday': '8:0-18:30', 'Saturday': '8:0-14:0'},
{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', 'Wednesday': '8:0-22:0', 'Thursday': '8:0-22:0', 'Friday': '8:0-23:0', 'Saturday': '8:0-23:0', 'Sunday': '8:0-22:0'},
{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', 'Wednesday': '7:0-20:0', 'Thursday': '7:0-20:0', 'Friday': '7:0-21:0', 'Saturday': '7:0-21:0', 'Sunday': '7:0-21:0'},
{'Wednesday': '14:0-22:0', 'Thursday': '16:0-22:0', 'Friday': '12:0-22:0', 'Saturday': '12:0-22:0', 'Sunday': '12:0-18:0'},
{'Monday': '0:0-0:0', 'Tuesday': '6:0-22:0', 'Wednesday': '6:0-22:0', 'Thursday': '6:0-22:0', 'Friday': '9:0-0:0', 'Saturday': '9:0-22:0', 'Sunday': '8:0-22:0'},
{'Monday': '0:0-0:0', 'Tuesday': '10:0-18:0', 'Wednesday': '10:0-18:0', 'Thursday': '10:0-18:0', 'Friday': '10:0-18:0', 'Saturday': '10:0-18:0', 'Sunday': '12:0-18:0'},
{'Monday': '9:0-17:0', 'Tuesday': '9:0-17:0', 'Wednesday': '9:0-17:0', 'Thursday': '9:0-17:0', 'Friday': '9:0-17:0'},
None,
{'Monday': '0:0-0:0', 'Tuesday': '6:0-21:0', 'Wednesday': '6:0-21:0', 'Thursday': '6:0-16:0', 'Friday': '6:0-16:0', 'Saturday': '6:0-17:0', 'Sunday': '6:0-21:0'}]

Doing:

df = pd.json_normalize(data)
print(df)

Output:

     Monday    Tuesday  Wednesday   Thursday     Friday   Saturday     Sunday
0   0:0-0:0  8:0-18:30  8:0-18:30  8:0-18:30  8:0-18:30   8:0-14:0        NaN
1  8:0-22:0   8:0-22:0   8:0-22:0   8:0-22:0   8:0-23:0   8:0-23:0   8:0-22:0
2  7:0-20:0   7:0-20:0   7:0-20:0   7:0-20:0   7:0-21:0   7:0-21:0   7:0-21:0
3       NaN        NaN  14:0-22:0  16:0-22:0  12:0-22:0  12:0-22:0  12:0-18:0
4   0:0-0:0   6:0-22:0   6:0-22:0   6:0-22:0    9:0-0:0   9:0-22:0   8:0-22:0
5   0:0-0:0  10:0-18:0  10:0-18:0  10:0-18:0  10:0-18:0  10:0-18:0  12:0-18:0
6  9:0-17:0   9:0-17:0   9:0-17:0   9:0-17:0   9:0-17:0        NaN        NaN
7       NaN        NaN        NaN        NaN        NaN        NaN        NaN
8   0:0-0:0   6:0-21:0   6:0-21:0   6:0-16:0   6:0-16:0   6:0-17:0   6:0-21:0

Removing NaN rows, and resetting the index:

df = df.dropna(how='all').reset_index(drop=True)
print(df)

Output:

     Monday    Tuesday  Wednesday   Thursday     Friday   Saturday     Sunday
0   0:0-0:0  8:0-18:30  8:0-18:30  8:0-18:30  8:0-18:30   8:0-14:0        NaN
1  8:0-22:0   8:0-22:0   8:0-22:0   8:0-22:0   8:0-23:0   8:0-23:0   8:0-22:0
2  7:0-20:0   7:0-20:0   7:0-20:0   7:0-20:0   7:0-21:0   7:0-21:0   7:0-21:0
3       NaN        NaN  14:0-22:0  16:0-22:0  12:0-22:0  12:0-22:0  12:0-18:0
4   0:0-0:0   6:0-22:0   6:0-22:0   6:0-22:0    9:0-0:0   9:0-22:0   8:0-22:0
5   0:0-0:0  10:0-18:0  10:0-18:0  10:0-18:0  10:0-18:0  10:0-18:0  12:0-18:0
6  9:0-17:0   9:0-17:0   9:0-17:0   9:0-17:0   9:0-17:0        NaN        NaN
7   0:0-0:0   6:0-21:0   6:0-21:0   6:0-16:0   6:0-16:0   6:0-17:0   6:0-21:0

Doing your function correctly:

def myFunc(v):
    if isinstance(v, str):
        start, end = v.split('-')
        hhs, mms = (int(v) for v in start.split(':'))
        hhe, mme = (int(v) for v in end.split(':'))
        return (hhe * 60   mme) - (hhs * 60   mms)
    else:
        return v 

df = df.applymap(myFunc)

Output:

   Monday  Tuesday  Wednesday  Thursday  Friday  Saturday  Sunday
0     0.0    630.0      630.0     630.0   630.0     360.0     NaN
1   840.0    840.0      840.0     840.0   900.0     900.0   840.0
2   780.0    780.0      780.0     780.0   840.0     840.0   840.0
3     NaN      NaN      480.0     360.0   600.0     600.0   360.0
4     0.0    960.0      960.0     960.0  -540.0     780.0   840.0
5     0.0    480.0      480.0     480.0   480.0     480.0   360.0
6   480.0    480.0      480.0     480.0   480.0       NaN     NaN
7     NaN      NaN        NaN       NaN     NaN       NaN     NaN
8     0.0    900.0      900.0     600.0   600.0     660.0   900.0

Total sum:

>>> df.sum().sum()
30600.0

Vectorized(ish), and working with timedelta's instead of re-inventing the wheel:

for col in df:
    x = df[col].str.extract('(\d*):(\d*)-(\d*):(\d*)').astype(float)
    h = x.loc[:, ::2].apply(pd.to_timedelta, unit='h')
    m = x.loc[:, 1::2].apply(pd.to_timedelta, unit='m')
    x = h   m.values
    df[col] = x[2] - x[0]

>>> df.sum().sum()
Timedelta('21 days 06:00:00')

>>> df.sum().sum().total_seconds() // 60
30600.0

>>> df
           Monday         Tuesday       Wednesday        Thursday            Friday        Saturday          Sunday
0 0 days 00:00:00 0 days 10:30:00 0 days 10:30:00 0 days 10:30:00   0 days 10:30:00 0 days 06:00:00             NaT
1 0 days 14:00:00 0 days 14:00:00 0 days 14:00:00 0 days 14:00:00   0 days 15:00:00 0 days 15:00:00 0 days 14:00:00
2 0 days 13:00:00 0 days 13:00:00 0 days 13:00:00 0 days 13:00:00   0 days 14:00:00 0 days 14:00:00 0 days 14:00:00
3             NaT             NaT 0 days 08:00:00 0 days 06:00:00   0 days 10:00:00 0 days 10:00:00 0 days 06:00:00
4 0 days 00:00:00 0 days 16:00:00 0 days 16:00:00 0 days 16:00:00 -1 days  15:00:00 0 days 13:00:00 0 days 14:00:00
5 0 days 00:00:00 0 days 08:00:00 0 days 08:00:00 0 days 08:00:00   0 days 08:00:00 0 days 08:00:00 0 days 06:00:00
6 0 days 08:00:00 0 days 08:00:00 0 days 08:00:00 0 days 08:00:00   0 days 08:00:00             NaT             NaT
7 0 days 00:00:00 0 days 15:00:00 0 days 15:00:00 0 days 10:00:00   0 days 10:00:00 0 days 11:00:00 0 days 15:00:00

CodePudding user response：

First of all, df_new[i] indexes df_new along the columns, not the index (as it seems you want). Instead, use df_new.loc[i], which does index along the columns. Additionally, you can use a try/except statement to catch any raised KeyError exceptions.

df_new = df_hours.dropna()

for i in range(1,150000):
    try:
        dc = df_new.loc[i]
    except KeyError:
        continue
    ...

However, you should just iterate over the index of the DataFrame, which avoids any KeyError exceptions:

df_new = df_hours.dropna()

for i in df_new.index:
    ...