I'm thinking for over a few hours how to fill n rows after filtering in polars with some value.
To give you an example, I'd like to do the following operation in polars.
Given a dataframe with column a
that have 1
s and 2
s, we want to create column b
that:
- Has
True
if1
in columna
. - Has the same number of
True
as number of1
s in columna
for2
s in columna
. This is kind of stratification. Rows to receiveTrue
should be random. - The rest of the rows in
b
has valueFalse
.
This is how I can do it in pandas:
df = pd.DataFrame({
'a': [2, 2, 2, 1, 2, 1]
})
df
a
0 2
1 2
2 2
3 1
4 2
5 1
n = df.shape[0]
n_1 = df['a'].value_counts()[1]
n_2 = n - n_1
df['b'] = False
df.loc[df['a'] == 1, 'b'] = True
idx = df.loc[df['a'] == 2].index[np.random.choice(n_2, n_1, replace=False)]
df.loc[idx, "b"] = True
df
a b
0 2 False
1 2 False
2 2 True
3 1 True
4 2 True
5 1 True
Any help appreciated!
CodePudding user response:
In general, I recommend avoiding "index"-type strategies, as they tend to be slow and inefficient. Also, we want to avoid sorting and re-sorting large datasets, particularly if they have lots of columns.
So instead, what we'll do is construct column b
separately from the original DataFrame, and then insert the finished column b
into the original DataFrame.
Since you are transitioning from Pandas, I'll walk through how we'll do this in Polars, and print the results at each step. (For your final solution, you can combine many of these intermediate steps and eliminate the implicit print statements after each step.)
Data
I'm going to expand your dataset, so that it has more columns than is strictly needed. This will show us how to isolate the columns we need, and how to put the final result back into your DataFrame.
import polars as pl
df = pl.DataFrame({
"a": [2, 2, 2, 1, 2, 1, 2],
"c": ['one', 'two', 'three', 'four', 'five', 'six', 'seven'],
"d": [6.0, 5, 4, 3, 2, 1, 0],
})
df
shape: (7, 3)
┌─────┬───────┬─────┐
│ a ┆ c ┆ d │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 │
╞═════╪═══════╪═════╡
│ 2 ┆ one ┆ 6.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ two ┆ 5.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ three ┆ 4.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ four ┆ 3.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ five ┆ 2.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ six ┆ 1.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ seven ┆ 0.0 │
└─────┴───────┴─────┘
Constructing column b
First, we'll create a new DataFrame using only column a
and add row numbers to track the original position of each element. We'll then sort the 1s
to the bottom - we'll see why in a moment.
df_a = df.select('a').with_row_count().sort('a', reverse=True)
df_a
shape: (7, 2)
┌────────┬─────┐
│ row_nr ┆ a │
│ --- ┆ --- │
│ u32 ┆ i64 │
╞════════╪═════╡
│ 0 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 6 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ 1 │
└────────┴─────┘
Next we'll count the 1s
and 2s
using the value_counts
method, which creates a new DataFrame with the results.
values = df_a.get_column('a').value_counts().sort('a')
values
shape: (2, 2)
┌─────┬────────┐
│ a ┆ counts │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞═════╪════════╡
│ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 5 │
└─────┴────────┘
So we have two 1s
and five 2s
. We'll create two variables with this information that we'll use later.
nr_1, nr_2 = values.get_column('counts')
print(f"{nr_1=}", f"{nr_2=}")
>>> print(f"{nr_1=}", f"{nr_2=}")
nr_1=2 nr_2=5
Now we'll construct the top part of b
, which corresponds to the five 2s
. We'll need three False
and two True
values. We'll use the shuffle
method to randomly shuffle the values. (You can set the seed=
value according to your needs.)
b = (
pl.repeat(True, nr_1, eager=True)
.extend_constant(False, nr_2 - nr_1)
.shuffle(seed=37)
)
b
shape: (5,)
Series: '' [bool]
[
true
false
false
false
true
]
Now let's extend b
with the two True
values that correspond to the 1s
(that we previously sorted to the bottom of our df_a
DataFrame.)
b = b.extend_constant(True, nr_1)
b
shape: (7,)
Series: '' [bool]
[
true
false
false
false
true
true
true
]
And we'll add this column to our df_a
, to see how the values of a
and b
align.
df_a = df_a.select([
pl.all(),
b.alias("b")]
)
df_a
shape: (7, 3)
┌────────┬─────┬───────┐
│ row_nr ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ bool │
╞════════╪═════╪═══════╡
│ 0 ┆ 2 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4 ┆ 2 ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 6 ┆ 2 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5 ┆ 1 ┆ true │
└────────┴─────┴───────┘
We see that our two 1s
at the bottom of column a
both correspond to a True
value in b
. And we see that two of the 2s
in column a
correspond to True
values, while the remaining values are False
.
Adding column b
back to our original DataFrame
All that's left to do is restore the original sort order, and insert column b
into our original DataFrame.
df_a = df_a.sort("row_nr")
df = df.select([
pl.all(),
df_a.get_column("b")
])
df
shape: (7, 4)
┌─────┬───────┬─────┬───────┐
│ a ┆ c ┆ d ┆ b │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ bool │
╞═════╪═══════╪═════╪═══════╡
│ 2 ┆ one ┆ 6.0 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ two ┆ 5.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ three ┆ 4.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ four ┆ 3.0 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ five ┆ 2.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ six ┆ 1.0 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ seven ┆ 0.0 ┆ true │
└─────┴───────┴─────┴───────┘
Simplifying
If your original DataFrame is not large, you don't need to create a separate df_a
-- you can sort (and re-sort) the original DataFrame. (But once your datasets get large, unnecessarily sorting lots of additional columns at each step can slow your computations.)
Also, you can combine many of the intermediate steps, as you see fit.