Home > Enterprise >  python-polars casting string to numeric
python-polars casting string to numeric

Time:09-15

When applying pandas.to_numeric,Pandas return dtype is float64 or int64 depending on the data supplied.https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html

is there an equivelent to do this in polars?

I have seen this How to cast a column with data type List[null] to List[i64] in polars however dont want to individually cast each column. got couple of string columns i want to turn numeric. this could be int or float values

#code to show casting in pandas.to_numeric
import pandas as pd
df = pd.DataFrame({"col1":["1","2"], "col2":["3.5", "4.6"]})
print("DataFrame:")
print(df)
df[["col1","col2"]]=df[["col1","col2"]].apply(pd.to_numeric)
print(df.dtypes)

CodePudding user response:

Unlike Pandas, Polars is quite picky about datatypes and tends to be rather unaccommodating when it comes to automatic casting. (Among the reasons is performance.)

You can create a feature request for a to_numeric method (but I'm not sure how enthusiastic the response will be.)

That said, here's some easy ways to accomplish this.

Create a method

Perhaps the simplest way is to write a method that attempts the cast to integer and then catches the exception. For convenience, you can even attach this method to the Series class itself.

def to_numeric(s: pl.Series) -> pl.Series:
    try:
        result = s.cast(pl.Int64)
    except pl.exceptions.ComputeError:
        result = s.cast(pl.Float64)
    return result


pl.Series.to_numeric = to_numeric

Then to use it:

(
    pl.select(
        s.to_numeric()
        for s in df
    )
)
shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ f64  │
╞══════╪══════╡
│ 1    ┆ 3.5  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2    ┆ 4.6  │
└──────┴──────┘

Use the automatic casting of csv parsing

Another method is to write your columns to a csv file (in a string buffer), and then have read_csv try to infer the types automatically. You may have to tweak the infer_schema_length parameter in some situations.

from io import StringIO
pl.read_csv(StringIO(df.write_csv()))
>>> pl.read_csv(StringIO(df.write_csv()))
shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ f64  │
╞══════╪══════╡
│ 1    ┆ 3.5  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2    ┆ 4.6  │
└──────┴──────┘
  • Related