I have a dataframe with multiple columns and I want to extract the rows that are unique in the manner of the SQL "select distinct" operation. So far whenever I look up forums on this I find comments about counting distinct (but I want the actual values) or (worse) values that are distinct in two columns just joined together as one set (using ravel). What I want is, for example for two columns, values that are distinct in pairs and the result as a dataframe.
I am considering now that the most effective method might be to write it myself - doing a stable sort on tuples and then scanning for duplicates. Any pandas expression that is no simpler than doing essentially that is not an answer to this question. I am looking for a basic or simple compound operation.
For those who do not know what a "distinct" in a query does ...
Starting with
1 2
2 3
1 2
4 5
2 3
2 1
we get back
1 2
2 3
4 5
2 1
Note - the question was asked should (2,1) and (1,2) be considered the same. No, as tuples are ordered. Again - refer to the behaviour of SQL for the details.
CodePudding user response:
To get the unique values of a given column, try pandas.Series.unique()
:
values = df['column_name'].unique()
To get unique combinations of given columns, try pandas.DataFrame.drop_duplicates()
:
df.drop_duplicates(subset=['colmun_name1', 'column_name2'])