Home > database >  How to add new rows to a dataframe based on ranges of two columns in the same dataframe?
How to add new rows to a dataframe based on ranges of two columns in the same dataframe?

Time:11-21

I have a dataframe that summarizes the segments of track within a bigger network. These segments have specific segement_ids and it looks like this:

import pandas as pd
import numpy as np
my_dict = {
    'segment_id':['a', 'b', 'c', 'd', 'e'],
    'km_start':[2,4,9,15,20],
    'km_end':[3,7,11,16,22],
    'min_km_start':[0,0,0,0,0],
    'max_km_end':[25,25,25,25,25]
}
df = pd.DataFrame(my_dict)
  segment_id  km_start  km_end  min_km_start  max_km_end
0          a         2       3             0          25
1          b         4       7             0          25
2          c         9      11             0          25
3          d        15      16             0          25
4          e        20      22             0          25

Graphically, I want to do the following:

yikes, image of code

Inessence, I need to find the gaps between the two pairs of columns (['km_start','km_end'] and ['min_km_start','max_km_end']) and fill in the segement_id in a descending order starting from -1.

Here is the desired output:

      segment_id  km_start  km_end  min_km_start  max_km_end
    0         -1         0       2             0          25
    1          a         2       3             0          25
    2         -2         3       4             0          25
    3          b         4       7             0          25
    4         -3         7       9             0          25
    5          c         9      11             0          25
    6         -4        11      15             0          25
    7          d        15      16             0          25
    8         -5        16      20             0          25
    9          e        20      22             0          25
   10         -6        22      25             0          25

CodePudding user response:

Try this

starts = pd.concat([pd.Series(df['min_km_start'].iloc[0]), df['km_end']]).reset_index(drop=True)
ends = pd.concat([df['km_start'], pd.Series(df['max_km_end'].iloc[0])]).reset_index(drop=True)
mask = ~starts.isin(df['km_start'])
pd.concat([df, pd.DataFrame({'km_start': starts[mask], 'km_end': ends[mask], 'segment_id': np.arange(-1,-mask.sum()-1,-1)})]).fillna(method='ffill').sort_values(by='km_start').reset_index(drop=True)

Output

    segment_id  km_start    km_end  min_km_start    max_km_end
0   -1          0           2       0.0             25.0
1   a           2           3       0.0             25.0
2   -2          3           4       0.0             25.0
3   b           4           7       0.0             25.0
4   -3          7           9       0.0             25.0
5   c           9           11      0.0             25.0
6   -4          11          15      0.0             25.0
7   d           15          16      0.0             25.0
8   -5          16          20      0.0             25.0
9   e           20          22      0.0             25.0
10  -6          22          25      0.0             25.0
  • Related