Home > Software engineering >  How to conditionally remove first N rows of a dataframe out of 2 dataframes
How to conditionally remove first N rows of a dataframe out of 2 dataframes

Time:04-28

I have 2 dataframes as following:

d = {'col1': [1, 2, 3, 4], 'col2': ["2010-01-01", "2011-01-01", "2012-01-01", "2013-01-01"]}
df = pd.DataFrame(data=d)
df

    col1    col2
0   1       2010-01-01
1   2       2011-01-01
2   3       2012-01-01
3   4       2013-01-01

The other dataframe may look like the following:

d2 = {'col1': [11, 22, 33, 44], 'col1': ["2011-01-01", "2011-05-01", "2012-02-01", "2012-06-01"]}
df2 = pd.DataFrame(data=d2)
df2

    col1    col2
0   11      2011-01-01
1   22      2011-05-01
2   33      2012-02-01
3   44      2012-06-01

In both dataframes, col2 includes dates (in String format, not as a date object) and these dates are placed in ascending order.

In my use case, both of these dataframes are supposed to start with the same value in col2.

The first col2 value of df is "2010-01-01". The first col2 value of df2 is "2011-01-01". In this particular example, since "2010-01-01" does not exist in df2 and "2011-01-01" is the second row item of col2 in df dataframe, the first row of df needs to be removed so that both dataframes start with the same date String value in col2. So, df is supposed to look as following after the change:

    col1    col2
1   2       2011-01-01
2   3       2012-01-01
3   4       2013-01-01

(Please note that the index is not reset after the change.)

But, this could have been the opposite and we could need to remove row(s) from df2 instead to be able to make these dataframes start with the same value in col2.

And in some cases, we may need to remove multiple rows, not only 1, from the dataframe where we need to remove the rows in case we do not find the match in the second row.

Corner case handling: The logic should also handle such cases (without throwing errors) where it is not possible to make these 2 dataframes start with the same col2 value, in case there is no match. In that case, all rows from both dataframes are supposed to be removed.

Is there an elegant way of developing this logic without writing too many lines of code?

CodePudding user response:

First get minimal datetime in both DataFrames and then filter all rows after this row in both DataFrames (because is possible some rows are removed from df1 or df2) by compare values by both and Series.cummax. If no match both DataFrames are empty.

both = df.merge(df2, on='col2')['col2'].min()
df = df[df['col2'].eq(both).cummax()]
print (df)
   col1        col2
1     2  2011-01-01
2     3  2012-01-01
3     4  2013-01-01

df2 = df2[df2['col2'].eq(both).cummax()]
print (df2)
   col1        col2
0    11  2011-01-01
1    22  2011-05-01
2    33  2012-02-01
3    44  2013-01-01

#no match df2.col2 in df.col2
d2 = {'col1': [11, 22, 33, 44], 'col2': ["2011-01-21", "2011-05-01",
                                         "2012-02-01", "2003-01-01"]}
df2 = pd.DataFrame(data=d2)


both = df.merge(df2, on='col2')['col2'].min()
df = df[df['col2'].eq(both).cummax()]
print (df)
Empty DataFrame
Columns: [col1, col2]
Index: []

df2 = df2[df2['col2'].eq(both).cummax()]
print (df2)
Empty DataFrame
Columns: [col1, col2]
Index: []
  • Related