Optimal way to get all duplicated rows in a polars dataframe-CodePudding

I want to filter all duplicated rows from a polars dataframe. What I've tried:

df = pl.DataFrame([['1', '1', '1', '1'], ['7', '7', '2', '7'], ['3', '9', '3', '9']])
df
shape: (4, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 3        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 7        ┆ 9        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 2        ┆ 3        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

df.filter(pl.all().is_duplicated())
shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 3        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 7        ┆ 9        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

This selects the first row, because it appears to go column-by-column and returns each row where all columns have a corresponding duplicate in the respective column - not the intended outcome.

Boolean indexing works:

df[df.is_duplicated(), :]
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 9        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

But it leaves me wondering

if this is indeed the only way to do it,
if there's a way to use .filter() and expressions to achieve the same result
if this is the most efficient way to achieve the desired result

CodePudding user response：

I am not sure this is optimal by any mean but you could concatenate all rows and check for duplicates, namely:

import polars as pl

df = pl.DataFrame([['1', '1', '1', '1'], ['7', '7', '2', '7'], ['3', '9', '3', '9']], columns=["a", "b", "c"])

df.filter(pl.concat_str(["a", "b", "c"]).is_duplicated())

shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 7   ┆ 9   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 7   ┆ 9   │
└─────┴─────┴─────┘

CodePudding user response：

In general, the is_duplicated method will likely perform best. Let's take a look at some alternative ways to accomplish this. And we'll do some (very) non-rigorous benchmarking - just to see which ones perform reasonably well.

Some alternatives

One alternative is a filter statement with an over (windowing) expression on all columns. One caution with windowed expressions - they are convenient, but can be costly performance-wise.

df.filter(pl.count("column_1").over(df.columns) > 1)

shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 9        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

Another alternative is a groupby, followed by a join. Basically, we'll count the number of times that combinations of columns occur. I'm using a semi join here, simply because I don't want to include the count column in my final results.

df.join(
    df=df.groupby(df.columns)
    .agg(pl.count().alias("count"))
    .filter(pl.col("count") > 1),
    on=df.columns,
    how="semi",
)

shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 9        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

Some (very) non-rigorous benchmarking

One way to see which alternatives perform reasonably well is to time the performance on a test dataset that might resemble the datasets that you will use. For lack of something better, I'll stick to something that looks close to the dataset in your question.

Set nbr_rows to something that will challenge your machine. (My machine is a 32-core system, so I'm going to choose a reasonably high number of rows.)

import numpy as np
import string

nbr_rows = 100_000_000
df = pl.DataFrame(
    {
        "col1": np.random.choice(1_000, nbr_rows,),
        "col2": np.random.choice(1_000, nbr_rows,),
        "col3": np.random.choice(list(string.ascii_letters), nbr_rows,),
        "col4": np.random.choice(1_000, nbr_rows,),
    }
)
print(df)

shape: (100000000, 4)
┌──────┬──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col4 │
│ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ str  ┆ i64  │
╞══════╪══════╪══════╪══════╡
│ 955  ┆ 186  ┆ j    ┆ 851  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 530  ┆ 199  ┆ d    ┆ 376  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 109  ┆ 609  ┆ G    ┆ 115  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 886  ┆ 487  ┆ d    ┆ 479  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...  ┆ ...  ┆ ...  ┆ ...  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 837  ┆ 406  ┆ Y    ┆ 60   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 467  ┆ 769  ┆ P    ┆ 344  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 548  ┆ 372  ┆ F    ┆ 410  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 379  ┆ 578  ┆ t    ┆ 287  │
└──────┴──────┴──────┴──────┘

Now let's benchmark some alternatives. Since these may or may not resemble your datasets (or your computing platform), I won't run the benchmarks multiple times. For our purposes, we're just trying to weed out alternatives that might perform very poorly.

Alternative: `is_duplicated`

import time
start = time.perf_counter()
df[df.is_duplicated(),:]
end = time.perf_counter()
print(end - start)

>>> print(end - start)
7.834882180000932

Since the is_duplicated method is provided by the Polars API, we can be reasonably assured that it will perform very well. Indeed, this should be the standard against which we compare other alternatives.

Alternative: `filter` using an `over` (windowing) expression

start = time.perf_counter()
df.filter(pl.count("col1").over(df.columns) > 1)
end = time.perf_counter()
print(end - start)

>>> print(end - start)
18.136289041000055

As expected, the over (windowing) expression is rather costly.

Alternative: `groupby` followed by a `join`

start = time.perf_counter()
df.join(
    df=df.groupby(df.columns)
    .agg(pl.count().alias("count"))
    .filter(pl.col("count") > 1),
    on=df.columns,
    how="semi",
)
end = time.perf_counter()
print(end - start)

>>> print(end - start)
9.419006452999383

Somewhat better ... but not as good as using the is_duplicated method provided by the Polars API.

Alternative: `concat_str`

Let's also look at an alternative suggested in another answer. To be fair, @FBruzzesi did say "I am not sure this is optimal by any means". But let's look at how it performs.

start = time.perf_counter()
df.filter(pl.concat_str(df.columns, sep='|').is_duplicated())
end = time.perf_counter()
print(end - start)

>>> print(end - start)
37.238660977998734

Did this help? If nothing else, this shows some different ways to use the Polars API.

Some alternatives

Some (very) non-rigorous benchmarking

Alternative: is_duplicated

Alternative: filter using an over (windowing) expression

Alternative: groupby followed by a join

Alternative: concat_str

Alternative: `is_duplicated`

Alternative: `filter` using an `over` (windowing) expression

Alternative: `groupby` followed by a `join`

Alternative: `concat_str`