Home > OS >  Highlight duplicate rows in Pandas
Highlight duplicate rows in Pandas

Time:09-22

I'm trying to highlight duplicate rows in several dataframes and export them to excel after. It seems to work when I do it with one dataframe, but as soon as I apply it to another dataframe is highlights the index from the last dataframe for all of them.

Here is my code:

import pandas as pd
import numpy as np

colA = list(range(1,6))
colB = ['aa', 'bb', 'aa', 'cc', 'aa']
colC = [14,3,14,9,12]
colD = [108, 2001, 152, 696, 696]
df = pd.DataFrame(list(zip(colA, colB, colC, colD)), columns =['colA', 'colB', 'colC', 'colD']) 
print(df)

colA = list(range(1,7))
colB = [208, np.nan ,201, 150, 900, 670]
colC = ['aa', np.nan ,'bb', 'aa', 'cc', 'bb']
df2 = pd.DataFrame(list(zip(colA, colB, colC)), columns =['colA', 'colB', 'colC'])
print(df2) 

def highlight_yellow(x):
    return ['background-color: yellow' if x.name in rows else '' for i in x]

row_series = df['colB'].dropna(how = 'any').duplicated(keep = False)
rows = row_series[row_series].index.values
df = df.style.apply(highlight_yellow, axis = 1)

row_series = df2['colC'].dropna(how = 'any').duplicated(keep = False)
rows = row_series[row_series].index.values
df2 = df2.style.apply(highlight_yellow, axis = 1)

When I just run the highlight function on the first dataframe it produces the correct results enter image description here

But when I run it on both dataframes it overrides the correct result in the first dataframe, and highlights the same rows in the first and second dataframe instead.

enter image description here

The end result should look like this:

enter image description here

enter image description here

CodePudding user response:

You can use:

def highlight_yellow(df, cols):
    style = 'background-color: yellow'
    a = np.broadcast_to(np.where(df.duplicated(subset=cols), style, '')[:,None], df.shape)
    return pd.DataFrame(a, index=df.index, columns=df.columns)   

df.style.apply(highlight_yellow, cols=['colB'], axis=None)

output:

enter image description here

df2.style.apply(highlight_yellow, cols=['colC'], axis=None)

enter image description here

CodePudding user response:

The dropna method first removes the rows with nan which affects the rows. You should do it separately something like below:

row_series = df2['colC'].duplicated(keep = False)
row_series[df2['colC'].isna()] = False
  • Related