I wanted to have a function defined in Python with a single input, a pandas DataFrame, and count then how many duplicate rows the input has.
I tried this code:
def pandasDupl(my_df):
duplicates = my_df.duplicated(keep=False).sum()
return duplicates
df_test = pd.DataFrame({"A":[3,3,3,3],"B":[5,5,5,3], "C":[5, 5, 5,3], "D": [3,3,3,3]}, index=["a","b","c","d"])
print(df_test)
pandasDuplicates(df_test)
Output: 3
But But I only want to count Duplicates, without the origin, so I wanted to have 2 as output (identical rows - Origin row) - I read that I have to remove keep=False, but when I remove this part, an error message appears, telling me that the attribute is missing.
CodePudding user response:
Hi and welcome :) Please try to format your code next time.
Have you tried:
df_test.duplicated(keep='first').sum() # 2
df_test = pd.DataFrame({"A":[3,3,3,3],"B":[5,5,5,3], "C":[5, 5, 5,3], "D": [3,3,3,3]}, index=["a","b","c","d"])
df_test.duplicated(keep='first').sum()
CodePudding user response:
you can value counts as well
len(df_test.value_counts().index)
#op
2
dummy_sample
df_test = pd.DataFrame([[1,2,3], [1,2,3], [4,5,6], [3,4,5],[3,4,5]])
#here there are three unique record
[1,2,3] , [4,5,6] and [3,4,5]
running same code
len(df_test.value_counts().index)
#op
3
I have change second row from your data posted from 5 to 4 ..
df_test = pd.DataFrame({"A":[3,3,3,3],"B":[4,5,5,3], "C":[5, 5, 5,3], "D": [3,3,3,3]}, index=["a","b","c","d"])