What's the equivalent of `pandas.Series.map(json.loads)` in polars?-CodePudding

Based on the document of polars, one can use json_path_match to extract JSON fields into string series.

But can we do something like pandas.Series.map(json.loads) to convert the whole JSON string at once? One can then further convert the loaded JSON series into another dataframe with sane dtypes.

I know I can do it first in pandas, but I'm looking for a way in polars.

CodePudding user response：

I should first point out that there is a polars.read_json method. For example:

import polars as pl
import io

json_file = """[{"a":"1", "b":10, "c":[1,2,3]},
{"a":"2", "b":20, "c":[3,4,5]},
{"a":"3.1", "b":30.2, "c":[8,8,8]},
{"a":"4", "b":40.0, "c":[9,9,90]}]
"""

pl.read_json(io.StringIO(json_file))

shape: (4, 3)
┌─────┬──────┬────────────┐
│ a   ┆ b    ┆ c          │
│ --- ┆ ---  ┆ ---        │
│ str ┆ f64  ┆ list [i64] │
╞═════╪══════╪════════════╡
│ 1   ┆ 10.0 ┆ [1, 2, 3]  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 20.0 ┆ [3, 4, 5]  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30.2 ┆ [8, 8, 8]  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ 40.0 ┆ [9, 9, 90] │
└─────┴──────┴────────────┘

But to answer your specific question about JSON data already loaded into a Series, I think what you're looking for is the polars.Series.apply method, which will apply a callable function to each cell of a Polars Series.

For example, let's say we have the following JSON fields already loaded into a Series in a Polars DataFrame:

import json
import polars as pl

df = pl.DataFrame(
    {
        "json_val": [
            '{"a":"1", "b":10, "c":[1,2,3]}',
            '{"a":"2", "b":20, "c":[3,4,5]}',
            '{"a":"3.1", "b":30.2, "c":[8,8,8]}',
            '{"a":"4", "b":40.0, "c":[9,9,90]}',
        ]
    }
)
print(df)

shape: (4, 1)
┌─────────────────────────────────────┐
│ json_val                            │
│ ---                                 │
│ str                                 │
╞═════════════════════════════════════╡
│ {"a":"1", "b":10, "c":[1,2,3]}      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"2", "b":20, "c":[3,4,5]}      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"3.1", "b":30.2, "c":[8,8,8... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"4", "b":40.0, "c":[9,9,90]... │
└─────────────────────────────────────┘

We can use apply and the json.loads function. In this example, that will yield a Series of type struct:

df.select(pl.col("json_val").apply(json.loads))

shape: (4, 1)
┌──────────────────────────┐
│ json_val                 │
│ ---                      │
│ struct[3]{'a', 'b', 'c'} │
╞══════════════════════════╡
│ {"1",10,[1, 2, 3]}       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"2",20,[3, 4, 5]}       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"3.1",30,[8, 8, 8]}     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"4",40,[9, 9, 90]}      │
└──────────────────────────┘

(One caution, notice how column b has been truncated to an integer.)

Depending on the structure of your JSON, you may be able to also use the polars.DataFrame.unnest function to split the json_val struct column into separate columns.

df.select(pl.col("json_val").apply(json.loads)).unnest("json_val")

shape: (4, 3)
┌─────┬─────┬────────────┐
│ a   ┆ b   ┆ c          │
│ --- ┆ --- ┆ ---        │
│ str ┆ i64 ┆ list [i64] │
╞═════╪═════╪════════════╡
│ 1   ┆ 10  ┆ [1, 2, 3]  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 20  ┆ [3, 4, 5]  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30  ┆ [8, 8, 8]  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ 40  ┆ [9, 9, 90] │
└─────┴─────┴────────────┘

Does this help get you started?

Edit: handling type-conversion issues

One general strategy that I use with any un-typed input file (especially csv files) is to return all values as a string/polars.Utf8 type. That way, I can explicitly convert types later, after I've had a chance to visually inspect the results. (I've been burned too often by "automatic" type conversions.)

The json.loads method has two helpful keyword options parse_float and parse_int that will help in this case. We can use a simple lambda function to tell the json parser to leave integer and float columns as strings.

# define our own translate function to keep floats/ints as strings
def json_translate(json_str: str):
    return json.loads(json_str, parse_float=lambda x: x, parse_int=lambda x: x)

df.select(pl.col("json_val").apply(f=json_translate))

shape: (4, 1)
┌────────────────────────────────┐
│ json_val                       │
│ ---                            │
│ struct[3]{'a', 'b', 'c'}       │
╞════════════════════════════════╡
│ {"1","10",["1", "2", "3"]}     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"2","20",["3", "4", "5"]}     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"3.1","30.2",["8", "8", "8"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"4","40.0",["9", "9", "90"]}  │
└────────────────────────────────┘

Notice that all the integer and float values are left as strings, and remain so when we use the unnest function (the column headers below show "str").

df.select(pl.col("json_val").apply(f=json_translate)).unnest('json_val')

shape: (4, 3)
┌─────┬──────┬──────────────────┐
│ a   ┆ b    ┆ c                │
│ --- ┆ ---  ┆ ---              │
│ str ┆ str  ┆ list [str]       │
╞═════╪══════╪══════════════════╡
│ 1   ┆ 10   ┆ ["1", "2", "3"]  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 20   ┆ ["3", "4", "5"]  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30.2 ┆ ["8", "8", "8"]  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ 40.0 ┆ ["9", "9", "90"] │
└─────┴──────┴──────────────────┘

From this point, you can use Polars' cast expression to convert the strings to the specific numeric types that you want. Here's a Stack Overflow question that can help with the cast.