I have a df with 9 rows of data in three currencies. I want to only remove all the rows with the same Currency, the same Description, the same Price, and same Ticker.
The end result should only remove the CAD on rows 6, 7, and 8 because the Description in EUR in row 2 and the Price in USD in row 5 are different from the rest. However, when I use drop_duplicates
, it will in fact remove duplicates like row 0, 1; and 3, 4; and all of CAD.
Here is my dataframe:
Description Currency Price Ticker
0 Trading EUR 100.32 EQUITIES
1 Trading EUR 100.32 EQUITIES
2 Debt EUR 100.32 EQUITIES
3 Trading USD 98.32 BOND
4 Trading USD 98.32 BOND
5 Trading USD 98.22 BOND
6 Debt CAD 99.35 BOND
7 Debt CAD 99.35 BOND
8 Debt CAD 99.35 BOND
CodePudding user response:
Create df:
description="Trading Trading Debt Trading Trading Trading Debt Debt Debt".split()
currency="EUR EUR EUR USD USD USD CAD CAD CAD".split()
price="100.32 100.32 100.32 98.32 98.32 98.22 99.35 99.35 99.35".split()
ticker="EQUITIES EQUITIES EQUITIES BOND BOND BOND BOND BOND BOND".split()
df = pd.DataFrame({"Description": description, "Currency": currency, "Price": price, "Ticker": ticker})
def hamming_distance(a, b):
return sum(x != y for x, y in zip(a, b))
grouped = df.groupby(by=['Description', 'Currency', 'Price', 'Ticker']).groups
for k, v in grouped.items():
if (len(v) > 1) and (1 not in [hamming_distance(k, j) for j in grouped.keys()]):
df = df.drop(index=v)
print(df)
Output:
0 Trading EUR 100.32 EQUITIES
1 Trading EUR 100.32 EQUITIES
2 Debt EUR 100.32 EQUITIES
3 Trading USD 98.32 BOND
4 Trading USD 98.32 BOND
5 Trading USD 98.22 BOND
Explanation:
Before dropping, we have:
for k, v in grouped.items():
print(k, v)
('Debt', 'CAD', '99.35', 'BOND') Int64Index([6, 7, 8], dtype='int64')
('Debt', 'EUR', '100.32', 'EQUITIES') Int64Index([2], dtype='int64')
('Trading', 'EUR', '100.32', 'EQUITIES') Int64Index([0, 1], dtype='int64')
('Trading', 'USD', '98.22', 'BOND') Int64Index([5], dtype='int64')
('Trading', 'USD', '98.32', 'BOND') Int64Index([3, 4], dtype='int64')
We drop all rows for a key tuple, if there does not exist another key tuple that differs in only one place (this is an assumption on my part, otherwise it's a bit of a free-for-all).
So, ('Trading', 'USD', '98.22', 'BOND')
and ('Trading', 'USD', '98.32', 'BOND')
"save" each other.
('Debt', 'EUR', '100.32', 'EQUITIES')
and ('Trading', 'EUR', '100.32', 'EQUITIES')
"save" each other.
CodePudding user response:
From what I read in the comment, do you mean drop all rows that have the same value in all columns. If so then you can use pd.drop_duplicates(keep=False)
. By default, subset is set to all columns (subset=None
) and keep the first duplicated row (keep='first'
)
import pandas as pd
data = {
'Description': ['Trading', 'Trading', 'Debt', 'Trading', 'Trading', 'Trading', 'Debt', 'Debt', 'Debt'],
'Currency': ['EUR', 'EUR', 'EUR', 'USD', 'USD', 'USD', 'CAD', 'CAD', 'CAD'],
'Price': [100.32, 100.32, 100.32, 98.32, 98.32, 98.22, 99.35, 99.35, 99.35],
'Ticker': ['EQUITIES', 'EQUITIES', 'EQUITIES', 'BOND', 'BOND', 'BOND', 'BOND', 'BOND', 'BOND']
}
df = pd.DataFrame.from_dict(data)
Output:
Description Currency Price Ticker
0 Trading EUR 100.32 EQUITIES
1 Trading EUR 100.32 EQUITIES
2 Debt EUR 100.32 EQUITIES
3 Trading USD 98.32 BOND
4 Trading USD 98.32 BOND
5 Trading USD 98.22 BOND
6 Debt CAD 99.35 BOND
7 Debt CAD 99.35 BOND
8 Debt CAD 99.35 BOND
We have:
- row 0 & 1 have the same info -> remove both
- row 2 has
Description
different from row 0 & 1 -> keep - row 3 & 4 have the same info -> remove both
- row 5 has
Price
different from row 3 & 4 -> keep - row 6, 7 & 8 have the same info -> remove all This will leave row 2 & 5
df = df.drop_duplicates(keep=False)
result in:
Description Currency Price Ticker
2 Debt EUR 100.32 EQUITIES
5 Trading USD 98.22 BOND