I have a csv file from which I need to remove duplicate rows on the basis of 3 columns. I tried the below code, but it deletes only once not all the possible duplicate.
ins.csv:
sr,instrument_token,exchange_token,tradingsymbol,name
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
python:
import pandas as pd
import numpy as np
ins = pd.read_csv('ins.csv')
new_ins = ins[pd.DataFrame(np.sort(ins[['instrument_token','exchange_token','tradingsymbol']].values.astype(str),1)).duplicated()]
new_ins.to_csv('ins.csv', mode='w', header=new_ins.columns.tolist(), index=False)
CodePudding user response:
import pandas as pd
import numpy as np
ins = pd.read_csv('ins.csv')
new_ins = ins.drop_duplicates(['instrument_token','exchange_token','tradingsymbol'], keep='first')
keep='first' is actually the default value so no need to add it. It means though that it will keep only the first occurence.
If you want to drop all of them, keep = False
CodePudding user response:
Another way other than using drop_duplicates
is to use groupby.nunique
.
df.groupby(['sr', 'instrument_token', 'exchange_token', 'tradingsymbol', 'name']).nunique().reset_index()
Out[24]:
sr instrument_token exchange_token tradingsymbol name
0 4 367376 2112 nf50 nf50
1 9 361216 2127 nfbf nfbf