Home > Blockchain >  How to remark duplicate records in Excel?
How to remark duplicate records in Excel?

Time:06-10

  1. I have the excel file.

    'Born Date   Student Name  Duplicate 
     2001-01-01     Tommy
     2001-03-04     Annie         
     2001-05-13     Billy
     2001-07-31     Bob
     2001-03-04     Annie         
     2001-04-04     Janet
     2001-11-05     Betty
     2001-12-15     Lucy          
     2001-08-25     Nicky
     2001-12-15     Lucy          
     2001-07-31     John
    
  2. I want to use Python update the excel file like as below result

    Born Date   Student Name  Duplicate 
     2001-01-01     Tommy
     2001-03-04     Annie        Yes
     2001-05-13     Billy
     2001-07-31     Bob
     2001-03-04     Annie        Yes
     2001-04-04     Janet
     2001-11-05     Betty
     2001-12-15     Lucy         Yes
     2001-08-25     Nicky
     2001-12-15     Lucy         Yes
     2001-07-31     John
    
  3. Below is my code... I'm a beginner in Python please teach me how to coding, Thanks!

     import pandas as pd
     df = pd.read_excel('C:/Users/Desktop/Studen List.xlsx')
     dc = df.duplicated()
     print(dc)
    

CodePudding user response:

This may help you to understand a bit further:

import pandas as pd


def insert_boolean_for_duplicates(excelfile):
  df = pd.read_excel(excelfile)
  dc = df.duplicated()
  print(dc)
  print('------------')
  print(df)
  print('------------')
  print(dc.drop_duplicates())
  print('------------')
  print(dc.values)
.
if __name__ == '__main__':
  insert_boolean_for_duplicates('StudentList.xlsx')

I recommend you to check out this link, It may help you out.

CodePudding user response:

What you did is not too far from answering your problem. All you need is to assign the pandas serie returned to a new column in your dataframe (How to add a new column to an existing DataFrame?). I would complete it as follows:

import pandas as pd

df = pd.read_excel('C:/Users/Desktop/Studen List.xlsx') #ideally you would prefere using relative paths (i.e. ./relative_path_to_file/Student_list.xlsx) 
df['Duplicate'] = df.duplicated()
df['Duplicate'] = df['Duplicate'].replace({True: 'Yes', False: 'No'}) # eventually if you want Yes / No labels rather than True / False

df.to_excel('./relative_path/Modified_Student_list.xlsx')

Hope this helps

  • Related