Home > Back-end >  In pandas, filter for duplicate values appearing in 1 of 2 different columns, for list of certain va
In pandas, filter for duplicate values appearing in 1 of 2 different columns, for list of certain va

Time:01-16

zed = pd.DataFrame(data = { 'date': ['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04', '2022-03-05'], 'a': [1, 5, 7, 3, 4], 'b': [3, 4, 9, 12, 5] })

How can the following dataframe be filtered to keep the earliest row (earliest == lowest date) for each of the 3 values 1, 5, 4 appearing in either column a or column b? In this example, the rows with dates '2022-03-01', '2022-03-02' would be kept as they are the lowest dates where each of the 3 values appears?

We have tried zed[zed.isin({'a': [1, 5, 4], 'b': [1, 5, 4]}).any(1)].sort_values(by=['date']) but this returns the incorrect result as it returns 3 rows.

CodePudding user response:

Without reshape your dataframe, you can use:

idx = max([zed[['a', 'b']].eq(i).sum(axis=1).idxmax() for i in [1, 5, 4]])
out = zed.loc[:idx]

Output:

>>> out
         date  a  b
0  2022-03-01  1  3
1  2022-03-02  5  4

CodePudding user response:

You can reshape by DataFrame.stack, so possible filterin gby list with remove duplicates:

s = zed.set_index('date')[['a','b']].stack()
idx = s[s.isin([1, 5, 4])].drop_duplicates().index.remove_unused_levels().levels[0]

print (idx)
Index(['2022-03-01', '2022-03-02'], dtype='object', name='date')

out = zed[zed['date'].isin(idx)]
print (out)
         date  a  b
0  2022-03-01  1  3
1  2022-03-02  5  4

Or filter first index value matching conditions, get unique values and select rows by DataFrame.loc:

L = [1, 5, 4]
idx = pd.unique([y for x in L for y in zed[zed[['a', 'b']].eq(x).any(axis=1)].index[:1]])

df = zed.loc[idx]
print (df)
         date  a  b
0  2022-03-01  1  3
1  2022-03-02  5  4
  • Related