Home > Back-end >  Count Duplicate Rows excl. the origin within a Dataframe
Count Duplicate Rows excl. the origin within a Dataframe

Time:10-01

I wanted to have a function defined in Python with a single input, a pandas DataFrame, and count then how many duplicate rows the input has.

I tried this code:

def pandasDupl(my_df):
    duplicates = my_df.duplicated(keep=False).sum()
    return duplicates

df_test = pd.DataFrame({"A":[3,3,3,3],"B":[5,5,5,3], "C":[5, 5, 5,3], "D": [3,3,3,3]}, index=["a","b","c","d"])
print(df_test)

pandasDuplicates(df_test)

Output: 3

But But I only want to count Duplicates, without the origin, so I wanted to have 2 as output (identical rows - Origin row) - I read that I have to remove keep=False, but when I remove this part, an error message appears, telling me that the attribute is missing.

CodePudding user response:

Hi and welcome :) Please try to format your code next time.

Have you tried:

df_test.duplicated(keep='first').sum()  # 2


df_test = pd.DataFrame({"A":[3,3,3,3],"B":[5,5,5,3], "C":[5, 5, 5,3], "D": [3,3,3,3]}, index=["a","b","c","d"])
df_test.duplicated(keep='first').sum()

CodePudding user response:

you can value counts as well

 len(df_test.value_counts().index)
 #op 
 2


dummy_sample

df_test = pd.DataFrame([[1,2,3], [1,2,3], [4,5,6], [3,4,5],[3,4,5]])

#here there are three unique record 
[1,2,3] , [4,5,6] and [3,4,5]
 
running same code 
len(df_test.value_counts().index)
#op 
3

I have change second row from your data posted from 5 to 4 ..
df_test = pd.DataFrame({"A":[3,3,3,3],"B":[4,5,5,3], "C":[5, 5, 5,3], "D": [3,3,3,3]}, index=["a","b","c","d"])

 

enter image description here

  • Related