Home > Net >  how to find duplicated rows of data in csv via python and output to a new csv
how to find duplicated rows of data in csv via python and output to a new csv

Time:10-21

Currently finding the duplicates but the data is not showing the row number, name and number and isn't outputting correctly (See below for expected output).

Here are the sample files: problem

Current result

current result

Currently Stuck code

https://pastebin.com/ypeGAQLQ

import pandas as pd
import os


df_state=pd.read_csv(r'D:\NP Year 3\CNETF\Labs\Lab 2\A2c Python Exercises\2 Duplicated\names_dup2.csv', quoting=1, header=None)[0] \
            .str.split('\t', expand=True) \
            .duplicated() \
            .to_csv('D:\\NP Year 3\\CNETF\\Labs\\Lab 2\\A2c Python Exercises\\2 Duplicated\\duplicated_data.csv', index=False, header=False)
print(df_state)

CodePudding user response:

This is happening because .duplicated returns a boolean series (True/False), which you are saving directly.

But you should be using this to subset the data, like so:

import pandas as pd
import os


df_state = pd.DataFrame(
                [["3 Liu Yu,876"],
                ["4 Koh chong,123"],
                ["3 Liu Yu,876"]])

df_state = df_state[0].str.split(" ", expand= True)
print(df_state, "\n")

duplicated = df_state.duplicated() # just a boolean series
print(duplicated, "\n")

print(df_state[duplicated], "\n")  ## <- subset and save with .to_csv

# as Anders Källmar points out, you can also do this:

all_duplicated = df_state.duplicated(keep= False)
print(df_state[all_duplicated])


Output:

   0    1          2
0  3  Liu     Yu,876
1  4  Koh  chong,123
2  3  Liu     Yu,876 

0    False
1    False
2     True
dtype: bool 

   0    1       2
2  3  Liu  Yu,876 

   0    1       2
0  3  Liu  Yu,876
2  3  Liu  Yu,876

CodePudding user response:

Use df.duplicated with keep=False to get a boolean mask of your dup rows then extract rows:

df = pd.read_csv('sample_dup.csv', header=None)
print(df)

                 0         1
0         CAIN TAN  86092142
1     YEO KIAT JUN  81901613
2       LIU YU YEE  89267062  # dup group 1 (1/2)
3   KOH CHONG YONG  89087115  # dup group 2 (1/3)
4   ROGER KOH KIAT  85875432
5     ONG JUN KANG  82065756
6       LIU YU YEE  89267062  # dup group 1 (2/2)
7   CHEW CHONG JIE  88326589
8   KOH CHONG YONG  89087115  # dup group 2 (2/3)
9      LEE SENG RI  87050554
10   ZHOU JING SHU  87204345
11      LOY KOH KE  83824725
12  WONG KANG SENG  88391206
13  CHEW KIAT KIAT  89209079
14  KOH CHONG YONG  89087115  # dup group 2 (3/3)
# 0 for your first column
>>> df[df[0].duplicated(keep=False)]
                 0         1
2       LIU YU YEE  89267062
3   KOH CHONG YONG  89087115
6       LIU YU YEE  89267062
8   KOH CHONG YONG  89087115
14  KOH CHONG YONG  89087115
  • Related