Home > other >  using IF condition in polars.dataframe
using IF condition in polars.dataframe

Time:05-08

this is an example of my code. im trying find a specific row using IF condition for example,

┌────────────┬────────────┬────────────┬────────────┬────────────┬───────┬──────┐
│ xmin       ┆ ymin       ┆ xmax       ┆ ymax       ┆ confidence ┆ class ┆ name │
│ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---   ┆ ---  │
│ f64        ┆ f64        ┆ f64        ┆ f64        ┆ f64        ┆ i64   ┆ str  │
╞════════════╪════════════╪════════════╪════════════╪════════════╪═══════╪══════╡
│ 385.432404 ┆ 265.198486 ┆ 402.597534 ┆ 286.265503 ┆ 0.880611   ┆ 0     ┆ corn │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 357.966461 ┆ 424.923828 ┆ 393.622803 ┆ 473.383209 ┆ 0.8493     ┆ 0     ┆ ice  │
└────────────┴────────────┴────────────┴────────────┴────────────┴───────┴──────┘

if boxes[(boxes['name']=='corn') & (boxes['ymax']>=250) & (boxes['ymin']<=265)]:
    print("found")
if boxes[(boxes['name']=='ice') & (boxes['ymax']>=460) & (boxes['ymin']<=600) & (boxes['xmin']<=600)]:
    print("found")

the code above works but when theres no data, it gives me an error ValueError: could not convert string to float: 'corn'. i assume its because every dtype of the column(xmin, ymin,xmax,ymax, confidence, class, name) is automatically set to f32 when no data is inserted. so how do i change it and is there a better way to do the same work because the code i wrote above doesnt look so optimizted to me. (sorry for my lack of english if its hard to understand what im trying to say.) i'd appreciate it if someone could help me

CodePudding user response:

Setting the data type of each column

To standardize the datatypes of each column in a collection of DataFrames, you can pass a list of (column name, datatype) tuples to the columns keyword when you create the DataFrames. This ensures that each DataFrame will have columns of the same name and datatype, even if the DataFrame is empty.

From the documentation for DataFrame:

columns: Sequence of str or (str,DataType) pairs, default None

Let's borrow some code from your previous post to see how this would work:

import polars as pl

ca = [
    ("xmin", pl.Float64),
    ("ymin", pl.Float64),
    ("xmax", pl.Float64),
    ("ymax", pl.Float64),
    ("confidence", pl.Float64),
    ("class", pl.Int32),
    ("name", pl.Utf8),
]  # xyxy columns

Notice that each column name is now part of a tuple, along with the data type that we want. I chose Float64 for most of your columns, but you can change that to something more appropriate. Here's a handy list of Polars datatypes.

Let's see how this would work (again, borrowing code from your previous post).

a = [
    [],
    [
        [
            370.01605224609375,
            346.4305114746094,
            398.3968811035156,
            384.5684814453125,
            0.9011853933334351,
            0,
            "corn",
        ]
    ],
]

for x in a:
    print(pl.DataFrame(x or None, columns=ca, orient="row"))
shape: (0, 7)
┌──────┬──────┬──────┬──────┬────────────┬───────┬──────┐
│ xmin ┆ ymin ┆ xmax ┆ ymax ┆ confidence ┆ class ┆ name │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---        ┆ ---   ┆ ---  │
│ f64  ┆ f64  ┆ f64  ┆ f64  ┆ f64        ┆ i32   ┆ str  │
╞══════╪══════╪══════╪══════╪════════════╪═══════╪══════╡
└──────┴──────┴──────┴──────┴────────────┴───────┴──────┘
shape: (1, 7)
┌────────────┬────────────┬────────────┬────────────┬────────────┬───────┬──────┐
│ xmin       ┆ ymin       ┆ xmax       ┆ ymax       ┆ confidence ┆ class ┆ name │
│ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---   ┆ ---  │
│ f64        ┆ f64        ┆ f64        ┆ f64        ┆ f64        ┆ i32   ┆ str  │
╞════════════╪════════════╪════════════╪════════════╪════════════╪═══════╪══════╡
│ 370.016052 ┆ 346.430511 ┆ 398.396881 ┆ 384.568481 ┆ 0.901185   ┆ 0     ┆ corn │
└────────────┴────────────┴────────────┴────────────┴────────────┴───────┴──────┘

Now corresponding columns have the same datatype, even for empty DataFrames. This will help us when we query our DataFrames. (We do not need to cast datatypes every time we query a DataFrame).

One further note: notice that I changed the DataFrame constructor from:

pl.DataFrame(x, columns=c, orient="row")

to

pl.DataFrame(x or None, columns=c, orient="row")

This is a workaround for cases when your DataFrame is empty. (This workaround may no longer be needed in future versions of Polars.)

Queries

Now that the datatypes of our columns in every DataFrame are standardized, even for empty DataFrames, we can run queries without concern for converting datatypes.

Let's first create a DataFrame using the data in your example:

_data = [
    [385.432404, 265.198486, 402.597534, 286.265503, 0.880611, 0, "corn"],
    [357.966461, 424.923828, 393.622803, 473.383209, 0.8493, 0, "ice"],
]

ca = [
    ("xmin", pl.Float64),
    ("ymin", pl.Float64),
    ("xmax", pl.Float64),
    ("ymax", pl.Float64),
    ("confidence", pl.Float64),
    ("class", pl.Int32),
    ("name", pl.Utf8),
]  # xyxy columns

boxes = pl.DataFrame(_data or None, columns=ca, orient="row")
print(boxes)
shape: (2, 7)
┌────────────┬────────────┬────────────┬────────────┬────────────┬───────┬──────┐
│ xmin       ┆ ymin       ┆ xmax       ┆ ymax       ┆ confidence ┆ class ┆ name │
│ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---   ┆ ---  │
│ f64        ┆ f64        ┆ f64        ┆ f64        ┆ f64        ┆ i32   ┆ str  │
╞════════════╪════════════╪════════════╪════════════╪════════════╪═══════╪══════╡
│ 385.432404 ┆ 265.198486 ┆ 402.597534 ┆ 286.265503 ┆ 0.880611   ┆ 0     ┆ corn │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 357.966461 ┆ 424.923828 ┆ 393.622803 ┆ 473.383209 ┆ 0.8493     ┆ 0     ┆ ice  │
└────────────┴────────────┴────────────┴────────────┴────────────┴───────┴──────┘

Let's look at the second of the two queries you posted:

if boxes[(boxes['name']=='ice') & (boxes['ymax']>=460) & (boxes['ymin']<=600) & (boxes['xmin']<=600)]:
    print("found")

In Polars, we run queries using the filter method. In Polars we would express this query as:

boxes.filter(
    (pl.col("name") == "ice")
    & (pl.col("ymax") >= 460)
    & (pl.col("ymin") <= 600)
    & (pl.col("xmin") <= 600)
)
shape: (1, 7)
┌────────────┬────────────┬────────────┬────────────┬────────────┬───────┬──────┐
│ xmin       ┆ ymin       ┆ xmax       ┆ ymax       ┆ confidence ┆ class ┆ name │
│ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---   ┆ ---  │
│ f64        ┆ f64        ┆ f64        ┆ f64        ┆ f64        ┆ i32   ┆ str  │
╞════════════╪════════════╪════════════╪════════════╪════════════╪═══════╪══════╡
│ 357.966461 ┆ 424.923828 ┆ 393.622803 ┆ 473.383209 ┆ 0.8493     ┆ 0     ┆ ice  │
└────────────┴────────────┴────────────┴────────────┴────────────┴───────┴──────┘

If you want to know if a query returned any records (so that you can use the result in an if statement), use the is_empty method:

my_query = boxes.filter(
    (pl.col("name") == "ice")
    & (pl.col("ymax") >= 460)
    & (pl.col("ymin") <= 600)
    & (pl.col("xmin") <= 600)
)

if not my_query.is_empty():
    print("I found records")
>>> my_query = boxes.filter(
...     (pl.col("name") == "ice")
...     & (pl.col("ymax") >= 460)
...     & (pl.col("ymin") <= 600)
...     & (pl.col("xmin") <= 600)
... 
... )
>>> if not my_query.is_empty():
...     print("I found records")
... 
I found records

The is_empty method is not strictly necessary. This will also work:

if my_query:
    print("I found records")
>>> if my_query:
...     print("I found records")
... 
I found records
  • Related