Home > Back-end >  Filling DataFrame with missing ranges present in two columns
Filling DataFrame with missing ranges present in two columns

Time:03-15

I have a dataframe which contains starting and ending Timestamp acting as range for audio clipping, which can be generated like this:

import pandas as pd

df = pd.DataFrame( 
{'start': 
  {0: pd.Timestamp('1900-01-01 00:00:14.373000'), 1: pd.Timestamp('1900-01-01 00:00:16.342000'),2: pd.Timestamp('1900-01-01 00:00:18.743000'), 3: pd.Timestamp('1900-01-01 00:00:21.383000'), 4: pd.Timestamp('1900-01-01 00:00:22.812000')}, 
'end': 
  {0: pd.Timestamp('1900-01-01 00:00:16.342000'), 1: pd.Timestamp('1900-01-01 00:00:18.543000'), 2: pd.Timestamp('1900-01-01 00:00:20.712000'), 3: pd.Timestamp('1900-01-01 00:00:22.482000'), 4: pd.Timestamp('1900-01-01 00:00:24.653000')}})
                      start                     end
0   1900-01-01 00:00:14.373 1900-01-01 00:00:16.342
1   1900-01-01 00:00:16.342 1900-01-01 00:00:18.543
2   1900-01-01 00:00:18.743 1900-01-01 00:00:20.712
3   1900-01-01 00:00:21.383 1900-01-01 00:00:22.482
4   1900-01-01 00:00:22.812 1900-01-01 00:00:24.653

And I want to generate a dataframe filling starting and ending timestamp where these timestamp don't come, meaning ranges where these entries is not present. So something like this:

pd.DataFrame( 
{'start': 
  {0: pd.Timestamp('1900-01-01 00:00:00.000000'), 1: pd.Timestamp('1900-01-01 00:00:14.373000'), 2: pd.Timestamp('1900-01-01 00:00:16.342000'), 3: pd.Timestamp('1900-01-01 00:00:18.543000'), 4: pd.Timestamp('1900-01-01 00:00:20.712000'), 5: pd.Timestamp('1900-01-01 00:00:21.383000'), 6: pd.Timestamp('1900-01-01 00:00:22.482000'), 7: pd.Timestamp('1900-01-01 00:00:22.812000')}, 
'end': 
  {0: pd.Timestamp('1900-01-01 00:00:14.373000'), 1: pd.Timestamp('1900-01-01 00:00:16.342000'), 2: pd.Timestamp('1900-01-01 00:00:18.543000'), 3: pd.Timestamp('1900-01-01 00:00:20.712000'), 4: pd.Timestamp('1900-01-01 00:00:21.383000'), 5: pd.Timestamp('1900-01-01 00:00:22.482000'), 6: pd.Timestamp('1900-01-01 00:00:22.812000'), 7: pd.Timestamp('1900-01-01 00:00:24.653000')}})
                      start                     end
0   1900-01-01 00:00:00.000 1900-01-01 00:00:14.373
1   1900-01-01 00:00:14.373 1900-01-01 00:00:16.342
2   1900-01-01 00:00:16.342 1900-01-01 00:00:18.543
3   1900-01-01 00:00:18.543 1900-01-01 00:00:20.712
4   1900-01-01 00:00:20.712 1900-01-01 00:00:21.383
5   1900-01-01 00:00:21.383 1900-01-01 00:00:22.482
6   1900-01-01 00:00:22.482 1900-01-01 00:00:22.812
7   1900-01-01 00:00:22.812 1900-01-01 00:00:24.653

I couldn't come with any feasible solutions other than iterating individual rows, what could be the best way to do this?

CodePudding user response:

IIUC, you could get all unique timestamps and generate a new dataframe from the shifted values:

vals = df[['start', 'end']].stack().unique()
vals2 = np.concatenate([np.array([0], dtype=vals.dtype), vals])
df2 = pd.DataFrame(zip(vals2, vals), columns=['start', 'end'])

output:

                    start                     end
0 1970-01-01 00:00:00.000 1900-01-01 00:00:14.373
1 1900-01-01 00:00:14.373 1900-01-01 00:00:16.342
2 1900-01-01 00:00:16.342 1900-01-01 00:00:18.543
3 1900-01-01 00:00:18.543 1900-01-01 00:00:18.743
4 1900-01-01 00:00:18.743 1900-01-01 00:00:20.712
5 1900-01-01 00:00:20.712 1900-01-01 00:00:21.383
6 1900-01-01 00:00:21.383 1900-01-01 00:00:22.482
7 1900-01-01 00:00:22.482 1900-01-01 00:00:22.812
8 1900-01-01 00:00:22.812 1900-01-01 00:00:24.653
  • Related