I have a large csv data set with a currency column. I need to clean the duplicates.
In :
df1=df.currency.unique()
print(df1)
Out :
['INR' 'USD' 'INR\n' 'USD\n']
I want to get rid of the rows containing strictly 'INR' and 'USD' and keep 'INR\n' and'USD\n'. I tried many things that didn't work for example:
df1 = df[df.index.str.contains('USD')]
But it mixes 'USD' and 'USD\n' and that's precisely what I want to avoid. Please advise.
Here is a sample of the datas :
product_code customer_code market_code order_date sales_qty \
0 Prod001 Cus001 Mark001 2017-10-10 100
2 Prod002 Cus003 Mark003 2018-04-06 1
3 Prod002 Cus003 Mark003 2018-04-11 1
4 Prod002 Cus004 Mark003 2018-06-18 6
5 Prod003 Cus005 Mark004 2017-11-20 59
... ... ... ... ... ...
150278 Prod339 Cus005 Mark004 2019-04-18 1
150279 Prod339 Cus020 Mark004 2019-04-23 1
150280 Prod339 Cus007 Mark004 2019-04-23 1
150281 Prod339 Cus006 Mark004 2019-04-24 7
150282 Prod339 Cus032 Mark009 2019-04-24 3
sales_amount currency
0 41241.0 INR
2 875.0 INR
3 583.0 INR
4 7176.0 INR
5 500.0 USD
... ... ...
150278 394.0 INR\n
150279 667.0 INR\n
150280 625.0 INR\n
150281 8625.0 INR\n
150282 3792.0 INR\n
[150281 rows x 7 columns]
CodePudding user response:
You could filter for '\n' like so:
df1 = df[df.index.str.contains('\n')]
CodePudding user response:
>>> df1 = ['INR', 'USD', 'INR\n', 'USD\n']
>>> uniq_df1 = set([curr.strip() for curr in df1])
>>> uniq_df1
{'USD', 'INR'}
>>> # if you need it in a list:
>>> uniq_df1 = list(set([curr.strip() for curr in df1]))
>>> uniq_df1
['USD', 'INR']
You see that I haven't preserved the ones with \n
, because this is not a good practice. A list of currencies is a list of ['USD', 'INR']
. This: ['USD\n', 'INR\n']
, is a list of currencies with new line. (Of course, strictly speaking they're both lists of strings, but for pedagogical reasons I used the word currencies.)
Leave the formating when printing, file-saving or anything else that needs a new line with the proper technique for it.