How to fill n random rows after filtering in polars-CodePudding

I'm thinking for over a few hours how to fill n rows after filtering in polars with some value.

To give you an example, I'd like to do the following operation in polars.

Given a dataframe with column a that have 1s and 2s, we want to create column b that:

Has True if 1 in column a.
Has the same number of True as number of 1s in column a for 2s in column a. This is kind of stratification. Rows to receive True should be random.
The rest of the rows in b has value False.

This is how I can do it in pandas:

df = pd.DataFrame({
    'a': [2, 2, 2, 1, 2, 1]
})
df

    a
0   2
1   2
2   2
3   1
4   2
5   1

n = df.shape[0]
n_1 = df['a'].value_counts()[1]
n_2 = n - n_1

df['b'] = False
df.loc[df['a'] == 1, 'b'] = True

idx = df.loc[df['a'] == 2].index[np.random.choice(n_2, n_1, replace=False)]
df.loc[idx, "b"] = True
df

    a   b
0   2   False
1   2   False
2   2   True
3   1   True
4   2   True
5   1   True

Any help appreciated!

CodePudding user response：

In general, I recommend avoiding "index"-type strategies, as they tend to be slow and inefficient. Also, we want to avoid sorting and re-sorting large datasets, particularly if they have lots of columns.

So instead, what we'll do is construct column b separately from the original DataFrame, and then insert the finished column b into the original DataFrame.

Since you are transitioning from Pandas, I'll walk through how we'll do this in Polars, and print the results at each step. (For your final solution, you can combine many of these intermediate steps and eliminate the implicit print statements after each step.)

Data

I'm going to expand your dataset, so that it has more columns than is strictly needed. This will show us how to isolate the columns we need, and how to put the final result back into your DataFrame.

import polars as pl

df = pl.DataFrame({
    "a": [2, 2, 2, 1, 2, 1, 2],
    "c": ['one', 'two', 'three', 'four', 'five', 'six', 'seven'],
    "d": [6.0, 5, 4, 3, 2, 1, 0],
})
df

shape: (7, 3)
┌─────┬───────┬─────┐
│ a   ┆ c     ┆ d   │
│ --- ┆ ---   ┆ --- │
│ i64 ┆ str   ┆ f64 │
╞═════╪═══════╪═════╡
│ 2   ┆ one   ┆ 6.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ two   ┆ 5.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ three ┆ 4.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ four  ┆ 3.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ five  ┆ 2.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ six   ┆ 1.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ seven ┆ 0.0 │
└─────┴───────┴─────┘

Constructing column `b`

First, we'll create a new DataFrame using only column a and add row numbers to track the original position of each element. We'll then sort the 1s to the bottom - we'll see why in a moment.

df_a = df.select('a').with_row_count().sort('a', reverse=True)
df_a

shape: (7, 2)
┌────────┬─────┐
│ row_nr ┆ a   │
│ ---    ┆ --- │
│ u32    ┆ i64 │
╞════════╪═════╡
│ 0      ┆ 2   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1      ┆ 2   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2      ┆ 2   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4      ┆ 2   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 6      ┆ 2   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3      ┆ 1   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 5      ┆ 1   │
└────────┴─────┘

Next we'll count the 1s and 2s using the value_counts method, which creates a new DataFrame with the results.

values = df_a.get_column('a').value_counts().sort('a')
values

shape: (2, 2)
┌─────┬────────┐
│ a   ┆ counts │
│ --- ┆ ---    │
│ i64 ┆ u32    │
╞═════╪════════╡
│ 1   ┆ 2      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2   ┆ 5      │
└─────┴────────┘

So we have two 1s and five 2s. We'll create two variables with this information that we'll use later.

nr_1, nr_2 = values.get_column('counts')
print(f"{nr_1=}", f"{nr_2=}")

>>> print(f"{nr_1=}", f"{nr_2=}")
nr_1=2 nr_2=5

Now we'll construct the top part of b, which corresponds to the five 2s. We'll need three False and two True values. We'll use the shuffle method to randomly shuffle the values. (You can set the seed= value according to your needs.)

b = (
    pl.repeat(True, nr_1, eager=True)
    .extend_constant(False, nr_2 - nr_1)
    .shuffle(seed=37)
)
b

shape: (5,)
Series: '' [bool]
[
        true
        false
        false
        false
        true
]

Now let's extend b with the two True values that correspond to the 1s (that we previously sorted to the bottom of our df_a DataFrame.)

b = b.extend_constant(True, nr_1)
b

shape: (7,)
Series: '' [bool]
[
        true
        false
        false
        false
        true
        true
        true
]

And we'll add this column to our df_a, to see how the values of a and b align.

df_a = df_a.select([
    pl.all(),
    b.alias("b")]
)
df_a

shape: (7, 3)
┌────────┬─────┬───────┐
│ row_nr ┆ a   ┆ b     │
│ ---    ┆ --- ┆ ---   │
│ u32    ┆ i64 ┆ bool  │
╞════════╪═════╪═══════╡
│ 0      ┆ 2   ┆ true  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1      ┆ 2   ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2      ┆ 2   ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4      ┆ 2   ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 6      ┆ 2   ┆ true  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3      ┆ 1   ┆ true  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5      ┆ 1   ┆ true  │
└────────┴─────┴───────┘

We see that our two 1s at the bottom of column a both correspond to a True value in b. And we see that two of the 2s in column a correspond to True values, while the remaining values are False.

Adding column `b` back to our original DataFrame

All that's left to do is restore the original sort order, and insert column b into our original DataFrame.

df_a = df_a.sort("row_nr")
df = df.select([
    pl.all(),
    df_a.get_column("b")
])
df

shape: (7, 4)
┌─────┬───────┬─────┬───────┐
│ a   ┆ c     ┆ d   ┆ b     │
│ --- ┆ ---   ┆ --- ┆ ---   │
│ i64 ┆ str   ┆ f64 ┆ bool  │
╞═════╪═══════╪═════╪═══════╡
│ 2   ┆ one   ┆ 6.0 ┆ true  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ two   ┆ 5.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ three ┆ 4.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1   ┆ four  ┆ 3.0 ┆ true  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ five  ┆ 2.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1   ┆ six   ┆ 1.0 ┆ true  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ seven ┆ 0.0 ┆ true  │
└─────┴───────┴─────┴───────┘

Simplifying

If your original DataFrame is not large, you don't need to create a separate df_a -- you can sort (and re-sort) the original DataFrame. (But once your datasets get large, unnecessarily sorting lots of additional columns at each step can slow your computations.)

Also, you can combine many of the intermediate steps, as you see fit.

Data

Constructing column b

Adding column b back to our original DataFrame

Simplifying

Constructing column `b`

Adding column `b` back to our original DataFrame