Based on the document of polars, one can use json_path_match
to extract JSON fields into string series.
But can we do something like pandas.Series.map(json.loads)
to convert the whole JSON string at once? One can then further convert the loaded JSON series into another dataframe with sane dtypes.
I know I can do it first in pandas
, but I'm looking for a way in polars
.
CodePudding user response:
I should first point out that there is a polars.read_json
method. For example:
import polars as pl
import io
json_file = """[{"a":"1", "b":10, "c":[1,2,3]},
{"a":"2", "b":20, "c":[3,4,5]},
{"a":"3.1", "b":30.2, "c":[8,8,8]},
{"a":"4", "b":40.0, "c":[9,9,90]}]
"""
pl.read_json(io.StringIO(json_file))
shape: (4, 3)
┌─────┬──────┬────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list [i64] │
╞═════╪══════╪════════════╡
│ 1 ┆ 10.0 ┆ [1, 2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ [3, 4, 5] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30.2 ┆ [8, 8, 8] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ [9, 9, 90] │
└─────┴──────┴────────────┘
But to answer your specific question about JSON data already loaded into a Series, I think what you're looking for is the polars.Series.apply
method, which will apply a callable function to each cell of a Polars Series.
For example, let's say we have the following JSON fields already loaded into a Series in a Polars DataFrame:
import json
import polars as pl
df = pl.DataFrame(
{
"json_val": [
'{"a":"1", "b":10, "c":[1,2,3]}',
'{"a":"2", "b":20, "c":[3,4,5]}',
'{"a":"3.1", "b":30.2, "c":[8,8,8]}',
'{"a":"4", "b":40.0, "c":[9,9,90]}',
]
}
)
print(df)
shape: (4, 1)
┌─────────────────────────────────────┐
│ json_val │
│ --- │
│ str │
╞═════════════════════════════════════╡
│ {"a":"1", "b":10, "c":[1,2,3]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"2", "b":20, "c":[3,4,5]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"3.1", "b":30.2, "c":[8,8,8... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"a":"4", "b":40.0, "c":[9,9,90]... │
└─────────────────────────────────────┘
We can use apply
and the json.loads
function. In this example, that will yield a Series of type struct
:
df.select(pl.col("json_val").apply(json.loads))
shape: (4, 1)
┌──────────────────────────┐
│ json_val │
│ --- │
│ struct[3]{'a', 'b', 'c'} │
╞══════════════════════════╡
│ {"1",10,[1, 2, 3]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"2",20,[3, 4, 5]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"3.1",30,[8, 8, 8]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"4",40,[9, 9, 90]} │
└──────────────────────────┘
(One caution, notice how column b
has been truncated to an integer.)
Depending on the structure of your JSON, you may be able to also use the polars.DataFrame.unnest
function to split the json_val
struct column into separate columns.
df.select(pl.col("json_val").apply(json.loads)).unnest("json_val")
shape: (4, 3)
┌─────┬─────┬────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ list [i64] │
╞═════╪═════╪════════════╡
│ 1 ┆ 10 ┆ [1, 2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20 ┆ [3, 4, 5] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30 ┆ [8, 8, 8] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40 ┆ [9, 9, 90] │
└─────┴─────┴────────────┘
Does this help get you started?
Edit: handling type-conversion issues
One general strategy that I use with any un-typed input file (especially csv files) is to return all values as a string/polars.Utf8
type. That way, I can explicitly convert types later, after I've had a chance to visually inspect the results. (I've been burned too often by "automatic" type conversions.)
The json.loads
method has two helpful keyword options parse_float
and parse_int
that will help in this case. We can use a simple lambda function to tell the json parser to leave integer and float columns as strings.
# define our own translate function to keep floats/ints as strings
def json_translate(json_str: str):
return json.loads(json_str, parse_float=lambda x: x, parse_int=lambda x: x)
df.select(pl.col("json_val").apply(f=json_translate))
shape: (4, 1)
┌────────────────────────────────┐
│ json_val │
│ --- │
│ struct[3]{'a', 'b', 'c'} │
╞════════════════════════════════╡
│ {"1","10",["1", "2", "3"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"2","20",["3", "4", "5"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"3.1","30.2",["8", "8", "8"]} │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ {"4","40.0",["9", "9", "90"]} │
└────────────────────────────────┘
Notice that all the integer and float values are left as strings, and remain so when we use the unnest
function (the column headers below show "str").
df.select(pl.col("json_val").apply(f=json_translate)).unnest('json_val')
shape: (4, 3)
┌─────┬──────┬──────────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ list [str] │
╞═════╪══════╪══════════════════╡
│ 1 ┆ 10 ┆ ["1", "2", "3"] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20 ┆ ["3", "4", "5"] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.1 ┆ 30.2 ┆ ["8", "8", "8"] │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ ["9", "9", "90"] │
└─────┴──────┴──────────────────┘
From this point, you can use Polars' cast
expression to convert the strings to the specific numeric types that you want. Here's a Stack Overflow question that can help with the cast
.