I'm trying to create a polars dataframe from a dictionary (mainDict) where one of the values of mainDict is a list of dict objects (nestedDicts). When I try to do this I get an error (see below) that I don't know the meaning of. However, pandas does allow me to create a dataframe using mainDict.
I'm not sure whether I'm doing something wrong, if it's a bug, or if this operation simply isn't supported by polars. I'm not too worried about finding a workaround as it should be straightforward (suggestions are welcome), but I'd like to do it this way if possible.
I'm on polars version 0.13.38 on google colab (problem also happens locally on VScode, with python version 3.9.6 and windows 10). Below is an example of code that reproduces the problem and its output. Thanks!
INPUT:
import polars as pl
import pandas as pd
template = { 'a':['A', 'AA'],
'b':['B', 'BB'],
'c':['C', 'CC'],
'd':[{'D1':'D2'}, {'DD1':'DD2'}]}
#create a dataframe using pandas
df_pandas = pd.DataFrame(template)
print(df_pandas)
#create a dataframe using polars
df_polars = pl.DataFrame(template)
print(df_polars)
OUTPUT:
a b c d
0 A B C {'D1': 'D2'}
1 AA BB CC {'DD1': 'DD2'}
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
<ipython-input-9-2abdc86d91da> in <module>()
12
13 #create a dataframe using polars
---> 14 df_polars = pl.DataFrame(template)
15 print(df_polars)
3 frames
/usr/local/lib/python3.7/dist-packages/polars/internals/frame.py in __init__(self, data, columns, orient)
300
301 elif isinstance(data, dict):
--> 302 self._df = dict_to_pydf(data, columns=columns)
303
304 elif isinstance(data, np.ndarray):
/usr/local/lib/python3.7/dist-packages/polars/internals/construction.py in dict_to_pydf(data, columns)
400 return PyDataFrame(data_series)
401 # fast path
--> 402 return PyDataFrame.read_dict(data)
403
404
/usr/local/lib/python3.7/dist-packages/polars/internals/series.py in __init__(self, name, values, dtype, strict, nan_to_null)
225 self._s = self.cast(dtype, strict=True)._s
226 elif isinstance(values, Sequence):
--> 227 self._s = sequence_to_pyseries(name, values, dtype=dtype, strict=strict)
228 elif _PANDAS_AVAILABLE and isinstance(values, (pd.Series, pd.DatetimeIndex)):
229 self._s = pandas_to_pyseries(name, values)
/usr/local/lib/python3.7/dist-packages/polars/internals/construction.py in sequence_to_pyseries(name, values, dtype, strict)
241 if constructor == PySeries.new_object:
242 try:
--> 243 return PySeries.new_from_anyvalues(name, values)
244 # raised if we cannot convert to Wrap<AnyValue>
245 except RuntimeError:
ComputeError: struct orders must remain the same
CodePudding user response:
The error you are receiving is because your list of dictionaries does not conform to the expectations for a Series
of struct
in Polars. More specifically, your two dictionaries {'D1':'D2'}
and {'DD1':'DD2'}
are mapped to two different types of structs in Polars and thus are incompatible for inclusion in the same Series
.
I'll first need to explain structs ...
Polars: Structs
In Polars, dictionaries are mapped to something called a struct
. A struct is an ordered, named collection of typed data. (In this regard, a struct is much like a Polars DataFrame
with only one row.)
In a struct:
- each field must have a unique field name
- each field has a datatype
- the order of the fields in a struct matters
Polars: Mapping Dictionaries to Structs
When dictionaries are mapped to structs (e.g., in a DataFrame
constructor), each key in the dictionary is mapped to a field name in the struct and the corresponding dictionary value is assigned to the value of that field in the struct.
Also, the order of the keys in the dictionary matters: the fields of the struct
are created in the same order as the keys in the dictionary. In Python, it's easy to forget that the keys in a dictionary are ordered.
Changed in version 3.7: Dictionary order is guaranteed to be insertion order. This behavior was an implementation detail of CPython from 3.6.
Polars: Series/Lists of structs
Here's where your input runs into trouble in Polars. A collection of structs can be included in the same Series
only if:
- The structs have the same number of fields
- The fields have the same names
- The fields are in the same order
- The datatype of each field is the same for each of the structs.
In your input, {'D1':'D2'}
is mapped to a struct with one field having a field name of "D1" and a value of "D2". However, {'DD1':'DD2'}
is mapped to a struct with one field having field name "DD1" and value "DD2". As such, the resulting structs are not compatible for inclusion in the same Series
. Their field names do not match.
In this instance, Polars is far more picky than Pandas, which allows for dictionaries with arbitrary key-value pairs to appear in the same column.
In general, you'll find that Polars is far more opinionated about data structures and data types than Pandas. (And part of the reason is performance-related.)
Workarounds
One workaround for your example is to alter your dictionaries so that they include the same keys, in the same order. For example:
template = {
"a": ["A", "AA"],
"b": ["B", "BB"],
"c": ["C", "CC"],
"d": [{"D1": "D2", "DD1": None}, {"D1": None, "DD1": "DD2"}],
}
pl.DataFrame(template)
shape: (2, 4)
┌─────┬─────┬─────┬──────────────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ struct[2] │
╞═════╪═════╪═════╪══════════════╡
│ A ┆ B ┆ C ┆ {"D2",null} │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ AA ┆ BB ┆ CC ┆ {null,"DD2"} │
└─────┴─────┴─────┴──────────────┘
Another easy workaround is to import the data into Pandas first, and then import the Pandas DataFrame into Polars. The import process will do the work for you.
template = {
"a": ["A", "AA"],
"b": ["B", "BB"],
"c": ["C", "CC"],
"d": [{"D1": "D2"}, {"DD1": "DD2"}],
}
pl.DataFrame(pd.DataFrame(template))
>>> pl.DataFrame(pd.DataFrame(template))
shape: (2, 4)
┌─────┬─────┬─────┬──────────────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ struct[2] │
╞═════╪═════╪═════╪══════════════╡
│ A ┆ B ┆ C ┆ {"D2",null} │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ AA ┆ BB ┆ CC ┆ {null,"DD2"} │
└─────┴─────┴─────┴──────────────┘
There may be other workarounds, but it will depend on your specific data and needs.