Home > OS >  pandas equivalent of SQL distinct
pandas equivalent of SQL distinct

Time:12-06

I have a dataframe with multiple columns and I want to extract the rows that are unique in the manner of the SQL "select distinct" operation. So far whenever I look up forums on this I find comments about counting distinct (but I want the actual values) or (worse) values that are distinct in two columns just joined together as one set (using ravel). What I want is, for example for two columns, values that are distinct in pairs and the result as a dataframe.

I am considering now that the most effective method might be to write it myself - doing a stable sort on tuples and then scanning for duplicates. Any pandas expression that is no simpler than doing essentially that is not an answer to this question. I am looking for a basic or simple compound operation.


For those who do not know what a "distinct" in a query does ...

Starting with

1 2
2 3
1 2
4 5
2 3
2 1

we get back

1 2
2 3
4 5
2 1

Note - the question was asked should (2,1) and (1,2) be considered the same. No, as tuples are ordered. Again - refer to the behaviour of SQL for the details.

CodePudding user response:

To get the unique values of a given column, try pandas.Series.unique():

values = df['column_name'].unique()

To get unique combinations of given columns, try pandas.DataFrame.drop_duplicates():

df.drop_duplicates(subset=['colmun_name1', 'column_name2'])
  • Related