I'm working on a Pandas df that looks like this:
Start End
0 16360 16362
1 16367 16381
2 16374 16399
3 16401 16413
4 16417 16427
5 16428 16437
6 16435 16441
7 16442 16444
8 16457 16463
In this dataframe, all 'Start'
values of a certain row come after the 'End'
values of the row before (i.e., row 1: 16367
> row 0: 16362
), but this is not true for rows 2
and 6
.
I'd like to make a counter i=0
which length increases for each time this "error" comes (in this case i
becomes i=2
). Something like:
for each Start value of my df:
if the value is < than the End of the row before:
i = i 1
How can I do this with Pandas?
Moreover, I want to make things harder: I'd like to add a 'Length'
column like this:
mydf['Length'] = mydf['End'] - mydf['Start']
To obtain something like this:
Start End Length
0 16360 16362 2
1 16367 16381 4
2 16374 16399 25
3 16401 16413 12
4 16417 16427 10
5 16428 16437 9
6 16435 16441 6
7 16442 16444 2
8 16457 16463 6
Again, for rows 2
and 6
I have the previously described problem. When this problem comes, I'd like to have the 'Length'
column that is not given by 'End'
- 'Start'
anymore but is the result of 'End' (i.e., of row 6)
- 'End' (of row 5)
.
In pseudocode could look like this:
for each Start value of my df:
if the value is < than the End of the row before:
mydf['Length'] = mydf['End'] of the actual row - mydf['End'] of the row before
Thank you!
CodePudding user response:
You can use:
# is the previous End > to the current Start?
m = df['End'].shift().gt(df['Start'])
# propagate error count
df['Error'] = m.cumsum()
# Length = End - Start if no error, else End - previous End
df['Length'] = df['End'].sub(df['Start'].mask(m).fillna(df['End'].shift()))
output:
Start End Error Length
0 16360 16362 0 2.0
1 16367 16381 0 14.0
2 16374 16399 1 18.0
3 16401 16413 1 12.0
4 16417 16427 1 10.0
5 16428 16437 1 9.0
6 16435 16441 2 4.0
7 16442 16444 2 2.0
8 16457 16463 2 6.0
CodePudding user response:
the alternative approach could be:
err = (df.End.shift()-df.Start).mask(lambda x: x<0, pd.NA)
i = err.count()
df['length'] = df.End - df.Start - err.fillna(0)
print(f'{i = }')
print(df)
'''
i = 2
Start End length
0 16360 16362 2.0
1 16367 16381 14.0
2 16374 16399 18.0
3 16401 16413 12.0
4 16417 16427 10.0
5 16428 16437 9.0
6 16435 16441 4.0
7 16442 16444 2.0
8 16457 16463 6.0
for better visualization you can add "err" column:
df['err'] = err
print(df)
'''
Start End length err
0 16360 16362 2.0 NaN
1 16367 16381 14.0 NaN
2 16374 16399 18.0 7.0
3 16401 16413 12.0 NaN
4 16417 16427 10.0 NaN
5 16428 16437 9.0 NaN
6 16435 16441 4.0 2.0
7 16442 16444 2.0 NaN
8 16457 16463 6.0 NaN