Home > database >  Comparison between Pandas dataframe column values
Comparison between Pandas dataframe column values

Time:07-15

I'm working on a Pandas df that looks like this:

   Start    End
0  16360  16362
1  16367  16381
2  16374  16399
3  16401  16413
4  16417  16427
5  16428  16437
6  16435  16441
7  16442  16444
8  16457  16463

In this dataframe, all 'Start' values of a certain row come after the 'End' values of the row before (i.e., row 1: 16367 > row 0: 16362), but this is not true for rows 2 and 6.

I'd like to make a counter i=0 which length increases for each time this "error" comes (in this case i becomes i=2). Something like:

for each Start value of my df:
    if the value is < than the End of the row before:
        i = i 1

How can I do this with Pandas?

Moreover, I want to make things harder: I'd like to add a 'Length' column like this:

mydf['Length'] = mydf['End'] - mydf['Start']

To obtain something like this:

   Start    End    Length
0  16360  16362    2
1  16367  16381    4
2  16374  16399    25
3  16401  16413    12
4  16417  16427    10
5  16428  16437    9
6  16435  16441    6
7  16442  16444    2
8  16457  16463    6

Again, for rows 2 and 6 I have the previously described problem. When this problem comes, I'd like to have the 'Length'  column that is not given by 'End'  - 'Start'  anymore but is the result of 'End' (i.e., of row 6) - 'End' (of row 5).

In pseudocode could look like this:

for each Start value of my df:
    if the value is < than the End of the row before:
        mydf['Length'] = mydf['End'] of the actual row - mydf['End'] of the row before

Thank you!

CodePudding user response:

You can use:

# is the previous End > to the current Start?
m = df['End'].shift().gt(df['Start'])
# propagate error count
df['Error'] = m.cumsum()
# Length = End - Start if no error, else End - previous End
df['Length'] = df['End'].sub(df['Start'].mask(m).fillna(df['End'].shift()))

output:

   Start    End  Error  Length
0  16360  16362      0     2.0
1  16367  16381      0    14.0
2  16374  16399      1    18.0
3  16401  16413      1    12.0
4  16417  16427      1    10.0
5  16428  16437      1     9.0
6  16435  16441      2     4.0
7  16442  16444      2     2.0
8  16457  16463      2     6.0

CodePudding user response:

the alternative approach could be:

err = (df.End.shift()-df.Start).mask(lambda x: x<0, pd.NA)
i = err.count()

df['length'] = df.End - df.Start - err.fillna(0)

print(f'{i = }')
print(df)
'''
i = 2
   Start    End  length
0  16360  16362     2.0
1  16367  16381    14.0
2  16374  16399    18.0
3  16401  16413    12.0
4  16417  16427    10.0
5  16428  16437     9.0
6  16435  16441     4.0
7  16442  16444     2.0
8  16457  16463     6.0

for better visualization you can add "err" column:

df['err'] = err
print(df)
'''
   Start    End  length  err
0  16360  16362     2.0  NaN
1  16367  16381    14.0  NaN
2  16374  16399    18.0  7.0
3  16401  16413    12.0  NaN
4  16417  16427    10.0  NaN
5  16428  16437     9.0  NaN
6  16435  16441     4.0  2.0
7  16442  16444     2.0  NaN
8  16457  16463     6.0  NaN
  • Related