I am trying to write a code that checks if there are any duplicates in the Unix
column in the pandas series. It is supposed to return a bool
value of True
for DataSet
since 130
is repeated twice and a False
for DataSet2
. How would I be able to get the expected output below?
import pandas as pd
DataSet = pd.DataFrame({'Unix':[130, 140, 150, 130],
'Value':[11,2,3,4]})
DataSet2 = pd.DataFrame({'Unix':[130, 140, 150, 130],
'Value':[11,2,3,4]})
print(DataSet.duplicated(subset=['Unix']).bool())
print(DataSet.duplicated(subset=['Unix']).bool())
Expected Output:
True
False
CodePudding user response:
Use .any()
instead of .bool()
:
print(DataSet.duplicated(subset=['Unix']).any())
print(DataSet2.duplicated(subset=['Unix']).any())
Output:
True
True
(Note that DataSet
and DataSet1
as given in the question are identical, thus True
for both. I assume this is just a typo and does not reflect your actual data.)
CodePudding user response:
you can get DataSet True following code
print(DataSet["Unix"].duplicated().sum() > 0)
i don know why DataSet2 is False
CodePudding user response:
You just need to call .any()
which returns True
if there are any truthy values in the collection passed to it as an argument (conversely, .all()
returns True
only if all values in a given collection are truthy):
In [5]: DataSet.duplicated(subset=['Unix']).any()
Out[5]: True
In [6]: DataSet2.duplicated(subset=['Unix']).any()
Out[6]: True
However, I think you may have accidentally copied the wrong data for DataSet2
as it is identical to DataSet1
and a False
answer wouldn't make sense.
Note that you can also use .is_unique
, which is more appropriate for your use-case:
In [7]: DataSet.Unix.is_unique
Out[7]: False
In [8]: DataSet2.Unix.is_unique
Out[8]: False
For reference, the .any()
and .all()
functions are convenience methods that are logically equivalent to using or
and and
, respectively, over all the items in a collection.