Home > Net >  Deleting duplicates based on condition
Deleting duplicates based on condition

Time:11-08

Here's my code:

import pathlib
from pathlib import Path
import os 
import pandas as pd 

df=pd.DataFrame(
    {
        'col1': ['red','blue','red','red','green','red','yellow', 'red', 'cyan', 'purple', 'pink', 'black', 'orange'],
        'col2': ['red', 'green','orange'],
    }
)

The desired output would be to eliminate every value that appears on col1 and also appears on col2, then drop col2 and then export the document to an excel spread sheet.

The desired df would look like this:

df=pd.DataFrame(
    {
        'col1': ['blue', 'yellow', 'cyan', 'purple', 'pink', 'black',]
    }
)

As of now my code looks this this, but it's not giving me the intended result as it does not drop the duplicate values.

df = df.drop(df[(df['col1'] == df['col2'])].index) 
df = df.drop(['col2'], inplace=True)
os.startfile(File_path, index=False)

Here is the updated code:

import pathlib
from pathlib import Path
import os 
import pandas as pd 

df_output = Path.home().joinpath("Desktop", "Remaining colors.xlsx")

df=pd.DataFrame(
    {
        'col1': ['red','blue','red','red','green','red','yellow', 'red', 'cyan', 'purple', 'pink', 'black', 'orange'],
        'col2': ['red', 'green','orange'],
    }
)

(df.loc[~df['col1'].isin(df["col2"]), ['col1']])

df = df.drop(columns='col2')
df.to_excel(df_output, index=False)
os.startfile(df_output)

CodePudding user response:

You need to use isin and boolean indexing:

(df.loc[~df['col1'].isin(df['col2']), ['col1']]
   .to_excel('filename.xlsx', index=False)
)

Output:

      col1
1     blue
6   yellow
8     cyan
9   purple
10    pink
11   black

Used input:

      col1    col2
0      red     red
1     blue   green
2      red  orange
3      red     NaN
4    green     NaN
5      red     NaN
6   yellow     NaN
7      red     NaN
8     cyan     NaN
9   purple     NaN
10    pink     NaN
11   black     NaN
12  orange     NaN
  • Related