this is an example of my code. im trying find a specific row using IF condition for example,
┌────────────┬────────────┬────────────┬────────────┬────────────┬───────┬──────┐
│ xmin ┆ ymin ┆ xmax ┆ ymax ┆ confidence ┆ class ┆ name │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ i64 ┆ str │
╞════════════╪════════════╪════════════╪════════════╪════════════╪═══════╪══════╡
│ 385.432404 ┆ 265.198486 ┆ 402.597534 ┆ 286.265503 ┆ 0.880611 ┆ 0 ┆ corn │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 357.966461 ┆ 424.923828 ┆ 393.622803 ┆ 473.383209 ┆ 0.8493 ┆ 0 ┆ ice │
└────────────┴────────────┴────────────┴────────────┴────────────┴───────┴──────┘
if boxes[(boxes['name']=='corn') & (boxes['ymax']>=250) & (boxes['ymin']<=265)]:
print("found")
if boxes[(boxes['name']=='ice') & (boxes['ymax']>=460) & (boxes['ymin']<=600) & (boxes['xmin']<=600)]:
print("found")
the code above works but when theres no data, it gives me an error ValueError: could not convert string to float: 'corn'. i assume its because every dtype of the column(xmin, ymin,xmax,ymax, confidence, class, name) is automatically set to f32 when no data is inserted. so how do i change it and is there a better way to do the same work because the code i wrote above doesnt look so optimizted to me. (sorry for my lack of english if its hard to understand what im trying to say.) i'd appreciate it if someone could help me
CodePudding user response:
Setting the data type of each column
To standardize the datatypes of each column in a collection of DataFrames, you can pass a list of (column name, datatype) tuples to the columns
keyword when you create the DataFrames. This ensures that each DataFrame will have columns of the same name and datatype, even if the DataFrame is empty.
From the documentation for DataFrame:
columns: Sequence of str or (str,DataType) pairs, default None
Let's borrow some code from your previous post to see how this would work:
import polars as pl
ca = [
("xmin", pl.Float64),
("ymin", pl.Float64),
("xmax", pl.Float64),
("ymax", pl.Float64),
("confidence", pl.Float64),
("class", pl.Int32),
("name", pl.Utf8),
] # xyxy columns
Notice that each column name is now part of a tuple, along with the data type that we want. I chose Float64
for most of your columns, but you can change that to something more appropriate. Here's a handy list of Polars datatypes.
Let's see how this would work (again, borrowing code from your previous post).
a = [
[],
[
[
370.01605224609375,
346.4305114746094,
398.3968811035156,
384.5684814453125,
0.9011853933334351,
0,
"corn",
]
],
]
for x in a:
print(pl.DataFrame(x or None, columns=ca, orient="row"))
shape: (0, 7)
┌──────┬──────┬──────┬──────┬────────────┬───────┬──────┐
│ xmin ┆ ymin ┆ xmax ┆ ymax ┆ confidence ┆ class ┆ name │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ i32 ┆ str │
╞══════╪══════╪══════╪══════╪════════════╪═══════╪══════╡
└──────┴──────┴──────┴──────┴────────────┴───────┴──────┘
shape: (1, 7)
┌────────────┬────────────┬────────────┬────────────┬────────────┬───────┬──────┐
│ xmin ┆ ymin ┆ xmax ┆ ymax ┆ confidence ┆ class ┆ name │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ i32 ┆ str │
╞════════════╪════════════╪════════════╪════════════╪════════════╪═══════╪══════╡
│ 370.016052 ┆ 346.430511 ┆ 398.396881 ┆ 384.568481 ┆ 0.901185 ┆ 0 ┆ corn │
└────────────┴────────────┴────────────┴────────────┴────────────┴───────┴──────┘
Now corresponding columns have the same datatype, even for empty DataFrames. This will help us when we query our DataFrames. (We do not need to cast
datatypes every time we query a DataFrame).
One further note: notice that I changed the DataFrame constructor from:
pl.DataFrame(x, columns=c, orient="row")
to
pl.DataFrame(x or None, columns=c, orient="row")
This is a workaround for cases when your DataFrame is empty. (This workaround may no longer be needed in future versions of Polars.)
Queries
Now that the datatypes of our columns in every DataFrame are standardized, even for empty DataFrames, we can run queries without concern for converting datatypes.
Let's first create a DataFrame using the data in your example:
_data = [
[385.432404, 265.198486, 402.597534, 286.265503, 0.880611, 0, "corn"],
[357.966461, 424.923828, 393.622803, 473.383209, 0.8493, 0, "ice"],
]
ca = [
("xmin", pl.Float64),
("ymin", pl.Float64),
("xmax", pl.Float64),
("ymax", pl.Float64),
("confidence", pl.Float64),
("class", pl.Int32),
("name", pl.Utf8),
] # xyxy columns
boxes = pl.DataFrame(_data or None, columns=ca, orient="row")
print(boxes)
shape: (2, 7)
┌────────────┬────────────┬────────────┬────────────┬────────────┬───────┬──────┐
│ xmin ┆ ymin ┆ xmax ┆ ymax ┆ confidence ┆ class ┆ name │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ i32 ┆ str │
╞════════════╪════════════╪════════════╪════════════╪════════════╪═══════╪══════╡
│ 385.432404 ┆ 265.198486 ┆ 402.597534 ┆ 286.265503 ┆ 0.880611 ┆ 0 ┆ corn │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 357.966461 ┆ 424.923828 ┆ 393.622803 ┆ 473.383209 ┆ 0.8493 ┆ 0 ┆ ice │
└────────────┴────────────┴────────────┴────────────┴────────────┴───────┴──────┘
Let's look at the second of the two queries you posted:
if boxes[(boxes['name']=='ice') & (boxes['ymax']>=460) & (boxes['ymin']<=600) & (boxes['xmin']<=600)]:
print("found")
In Polars, we run queries using the filter
method. In Polars we would express this query as:
boxes.filter(
(pl.col("name") == "ice")
& (pl.col("ymax") >= 460)
& (pl.col("ymin") <= 600)
& (pl.col("xmin") <= 600)
)
shape: (1, 7)
┌────────────┬────────────┬────────────┬────────────┬────────────┬───────┬──────┐
│ xmin ┆ ymin ┆ xmax ┆ ymax ┆ confidence ┆ class ┆ name │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ i32 ┆ str │
╞════════════╪════════════╪════════════╪════════════╪════════════╪═══════╪══════╡
│ 357.966461 ┆ 424.923828 ┆ 393.622803 ┆ 473.383209 ┆ 0.8493 ┆ 0 ┆ ice │
└────────────┴────────────┴────────────┴────────────┴────────────┴───────┴──────┘
If you want to know if a query returned any records (so that you can use the result in an if
statement), use the is_empty
method:
my_query = boxes.filter(
(pl.col("name") == "ice")
& (pl.col("ymax") >= 460)
& (pl.col("ymin") <= 600)
& (pl.col("xmin") <= 600)
)
if not my_query.is_empty():
print("I found records")
>>> my_query = boxes.filter(
... (pl.col("name") == "ice")
... & (pl.col("ymax") >= 460)
... & (pl.col("ymin") <= 600)
... & (pl.col("xmin") <= 600)
...
... )
>>> if not my_query.is_empty():
... print("I found records")
...
I found records
The is_empty
method is not strictly necessary. This will also work:
if my_query:
print("I found records")
>>> if my_query:
... print("I found records")
...
I found records