Once in a while I get to the point where I need to run the following line:

DF[‘is_flagged’] = DF[‘id’].isin(DF2[DF2[‘flag’]==1][‘id’])

Lately I started using polars, and I wonder how to convert it easily to polars.

Edit: For example:

df1 = polars.DataFrame._from_dict({     
'Animal_id': [1, 2, 3, 4, 5, 6, 7],     
'age': [4, 6, 3, 8, 3, 8, 9] })

df2 = polars.DataFrame._from_dict({     
'Animal_id': [1, 2, 3, 4, 5, 6, 7],     
'Animal_type': ['cat', 'dog', 'cat', 'cat', 'dog', 'dog', 'cat'] })

Output:

polars.DataFrame._from_dict({
'animal_id': [1, 2, 3, 4, 5, 6, 7],
'age': [4, 6, 3, 8, 3, 8, 9],
'is_dog': [0, 1, 0, 0, 1, 1, 0]})

Without using flag and then join

I tried to use the is_in() function but this didn’t worked.

CodePudding user response：

How about

df1.with_column(
    df2.with_column(
        pl.col("Animal_type")
        .is_in(["dog"])
        .cast(int)
        .alias("is_dog")
    )["is_dog"]
)

This gives you:

Out[13]:
shape: (7, 3)
┌───────────┬─────┬────────┐
│ Animal_id ┆ age ┆ is_dog │
│ ---       ┆ --- ┆ ---    │
│ i64       ┆ i64 ┆ i64    │
╞═══════════╪═════╪════════╡
│ 1         ┆ 4   ┆ 0      │
│ 2         ┆ 6   ┆ 1      │
│ 3         ┆ 3   ┆ 0      │
│ 4         ┆ 8   ┆ 0      │
│ 5         ┆ 3   ┆ 1      │
│ 6         ┆ 8   ┆ 1      │
│ 7         ┆ 9   ┆ 0      │
└───────────┴─────┴────────┘

CodePudding user response：

So to be honest I am not quite sure from your question, your pandas snippet and your example what your desired solution is, but here are my three takes.

import polars as pl

df1 = pl.DataFrame(
    {"Animal_id": [1, 2, 3, 4, 5, 6, 7], "age": [4, 6, 3, 8, 3, 8, 9]}
).lazy()

df2 = pl.DataFrame(
    {
        "Animal_id": [1, 2, 3, 4, 5, 6, 7],
        "Animal_type": ["cat", "dog", "cat", "cat", "dog", "dog", "cat"],
    }
).lazy()

1 Solution

So this one is a small adaptation of the solution of @ignoring_gravity. So the assumption in his solution is that the DataFrames have the same length and the Animal_id matches in both tables. If that was your goal I want to give you another solution because by subsetting (["is_dog"]) you lose the possibility to use the lazy api.

df1.with_context(
    df2.select(pl.col("Animal_type").is_in(["dog"]).cast(int).alias("is_dog"))
).select(["Animal_id", "age", "is_dog"]).collect()

2 Solution

So in case you want something more similar to your pandas snippet and because you wrote you don't want to have a join.

df1.with_context(
    df2.filter(pl.col("Animal_type") == "dog").select(
        pl.col("Animal_id").alias("valid_ids")
    )
).with_columns(
    [pl.col("Animal_id").is_in(pl.col("valid_ids")).cast(int).alias("is_dog")]
).collect()

3 Solution

So this would be the solution with a join. In my opinion the best solution regarding your example and example output, but maybe there are other reasons that speak against a join, which aren't apparent from your example.

df1.join(
    df2.select(
        ["Animal_id", pl.col("Animal_type").is_in(["dog"]).cast(int).alias("is_dog")]
    ),
    on=["Animal_id"],
).collect()