Home > OS >  Index of unique values in df.column
Index of unique values in df.column

Time:04-02

I have a large csv data set with a currency column. I need to clean the duplicates.

In :

df1=df.currency.unique()
print(df1)

Out :

['INR' 'USD' 'INR\n' 'USD\n']

I want to get rid of the rows containing strictly 'INR' and 'USD' and keep 'INR\n' and'USD\n'. I tried many things that didn't work for example:

df1 = df[df.index.str.contains('USD')]

But it mixes 'USD' and 'USD\n' and that's precisely what I want to avoid. Please advise.

Here is a sample of the datas :

      product_code customer_code market_code  order_date  sales_qty  \
0           Prod001        Cus001     Mark001  2017-10-10        100   
2           Prod002        Cus003     Mark003  2018-04-06          1   
3           Prod002        Cus003     Mark003  2018-04-11          1   
4           Prod002        Cus004     Mark003  2018-06-18          6   
5           Prod003        Cus005     Mark004  2017-11-20         59   
...             ...           ...         ...         ...        ...   
150278      Prod339        Cus005     Mark004  2019-04-18          1   
150279      Prod339        Cus020     Mark004  2019-04-23          1   
150280      Prod339        Cus007     Mark004  2019-04-23          1   
150281      Prod339        Cus006     Mark004  2019-04-24          7   
150282      Prod339        Cus032     Mark009  2019-04-24          3   

        sales_amount currency  
0            41241.0      INR  
2              875.0      INR  
3              583.0      INR  
4             7176.0      INR  
5              500.0      USD  
...              ...      ...  
150278         394.0    INR\n  
150279         667.0    INR\n  
150280         625.0    INR\n  
150281        8625.0    INR\n  
150282        3792.0    INR\n  

[150281 rows x 7 columns]

CodePudding user response:

You could filter for '\n' like so:

df1 = df[df.index.str.contains('\n')]

CodePudding user response:

>>> df1 = ['INR', 'USD', 'INR\n', 'USD\n']
>>> uniq_df1 = set([curr.strip() for curr in df1])
>>> uniq_df1
{'USD', 'INR'}
>>> # if you need it in a list:
>>> uniq_df1 = list(set([curr.strip() for curr in df1]))
>>> uniq_df1
['USD', 'INR']

You see that I haven't preserved the ones with \n, because this is not a good practice. A list of currencies is a list of ['USD', 'INR']. This: ['USD\n', 'INR\n'], is a list of currencies with new line. (Of course, strictly speaking they're both lists of strings, but for pedagogical reasons I used the word currencies.)

Leave the formating when printing, file-saving or anything else that needs a new line with the proper technique for it.

  • Related