Pandas remove duplicated rows (more than two rows)-CodePudding

I have a df with 9 rows of data in three currencies. I want to only remove all the rows with the same Currency, the same Description, the same Price, and same Ticker.

The end result should only remove the CAD on rows 6, 7, and 8 because the Description in EUR in row 2 and the Price in USD in row 5 are different from the rest. However, when I use drop_duplicates, it will in fact remove duplicates like row 0, 1; and 3, 4; and all of CAD.

Here is my dataframe:

  Description Currency   Price    Ticker
0     Trading      EUR  100.32  EQUITIES
1     Trading      EUR  100.32  EQUITIES
2        Debt      EUR  100.32  EQUITIES
3     Trading      USD   98.32      BOND
4     Trading      USD   98.32      BOND
5     Trading      USD   98.22      BOND
6        Debt      CAD   99.35      BOND
7        Debt      CAD   99.35      BOND
8        Debt      CAD   99.35      BOND

CodePudding user response：

Create df:

description="Trading Trading Debt Trading Trading Trading Debt Debt Debt".split()
currency="EUR EUR EUR USD USD USD CAD CAD CAD".split()
price="100.32 100.32 100.32 98.32 98.32 98.22 99.35 99.35 99.35".split()
ticker="EQUITIES EQUITIES EQUITIES BOND BOND BOND BOND BOND BOND".split()
df = pd.DataFrame({"Description": description, "Currency": currency, "Price": price, "Ticker": ticker})

def hamming_distance(a, b):
    return sum(x != y for x, y in zip(a, b))

grouped = df.groupby(by=['Description', 'Currency', 'Price', 'Ticker']).groups
for k, v in grouped.items():
    if (len(v) > 1) and (1 not in [hamming_distance(k, j) for j in grouped.keys()]):
        df = df.drop(index=v)

print(df)

Output:

0     Trading      EUR  100.32  EQUITIES
1     Trading      EUR  100.32  EQUITIES
2        Debt      EUR  100.32  EQUITIES
3     Trading      USD   98.32      BOND
4     Trading      USD   98.32      BOND
5     Trading      USD   98.22      BOND

Explanation:

Before dropping, we have:

for k, v in grouped.items():
    print(k, v)

('Debt', 'CAD', '99.35', 'BOND') Int64Index([6, 7, 8], dtype='int64')
('Debt', 'EUR', '100.32', 'EQUITIES') Int64Index([2], dtype='int64')
('Trading', 'EUR', '100.32', 'EQUITIES') Int64Index([0, 1], dtype='int64')
('Trading', 'USD', '98.22', 'BOND') Int64Index([5], dtype='int64')
('Trading', 'USD', '98.32', 'BOND') Int64Index([3, 4], dtype='int64')

We drop all rows for a key tuple, if there does not exist another key tuple that differs in only one place (this is an assumption on my part, otherwise it's a bit of a free-for-all).

So, ('Trading', 'USD', '98.22', 'BOND') and ('Trading', 'USD', '98.32', 'BOND') "save" each other.

('Debt', 'EUR', '100.32', 'EQUITIES') and ('Trading', 'EUR', '100.32', 'EQUITIES') "save" each other.

CodePudding user response：

From what I read in the comment, do you mean drop all rows that have the same value in all columns. If so then you can use pd.drop_duplicates(keep=False). By default, subset is set to all columns (subset=None) and keep the first duplicated row (keep='first')

import pandas as pd
data = {
 'Description': ['Trading', 'Trading', 'Debt', 'Trading', 'Trading', 'Trading', 'Debt', 'Debt', 'Debt'],
 'Currency': ['EUR', 'EUR', 'EUR', 'USD', 'USD', 'USD', 'CAD', 'CAD', 'CAD'],
 'Price': [100.32, 100.32, 100.32, 98.32, 98.32, 98.22, 99.35, 99.35, 99.35],
 'Ticker': ['EQUITIES', 'EQUITIES', 'EQUITIES', 'BOND', 'BOND', 'BOND', 'BOND', 'BOND', 'BOND']
}
df = pd.DataFrame.from_dict(data)

Output:

    Description   Currency    Price    Ticker
0   Trading       EUR         100.32   EQUITIES
1   Trading       EUR         100.32   EQUITIES
2   Debt          EUR         100.32   EQUITIES
3   Trading       USD         98.32    BOND
4   Trading       USD         98.32    BOND
5   Trading       USD         98.22    BOND
6   Debt          CAD         99.35    BOND
7   Debt          CAD         99.35    BOND
8   Debt          CAD         99.35    BOND

We have:

row 0 & 1 have the same info -> remove both
row 2 has Description different from row 0 & 1 -> keep
row 3 & 4 have the same info -> remove both
row 5 has Price different from row 3 & 4 -> keep
row 6, 7 & 8 have the same info -> remove all This will leave row 2 & 5

df = df.drop_duplicates(keep=False)

result in:

    Description   Currency   Price    Ticker
2   Debt          EUR        100.32   EQUITIES
5   Trading       USD        98.22    BOND