Home > Blockchain >  Does Polars support creating a dataframe from a nested dictionary?
Does Polars support creating a dataframe from a nested dictionary?

Time:05-28

I'm trying to create a polars dataframe from a dictionary (mainDict) where one of the values of mainDict is a list of dict objects (nestedDicts). When I try to do this I get an error (see below) that I don't know the meaning of. However, pandas does allow me to create a dataframe using mainDict.

I'm not sure whether I'm doing something wrong, if it's a bug, or if this operation simply isn't supported by polars. I'm not too worried about finding a workaround as it should be straightforward (suggestions are welcome), but I'd like to do it this way if possible.

I'm on polars version 0.13.38 on google colab (problem also happens locally on VScode, with python version 3.9.6 and windows 10). Below is an example of code that reproduces the problem and its output. Thanks!

INPUT:

import polars as pl
import pandas as pd

template = {    'a':['A', 'AA'],
                'b':['B', 'BB'],
                'c':['C', 'CC'],
                'd':[{'D1':'D2'}, {'DD1':'DD2'}]}

#create a dataframe using pandas
df_pandas = pd.DataFrame(template)
print(df_pandas)

#create a dataframe using polars
df_polars = pl.DataFrame(template)
print(df_polars)

OUTPUT:

    a   b   c               d
0   A   B   C    {'D1': 'D2'}
1  AA  BB  CC  {'DD1': 'DD2'}
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
<ipython-input-9-2abdc86d91da> in <module>()
     12 
     13 #create a dataframe using polars
---> 14 df_polars = pl.DataFrame(template)
     15 print(df_polars)

3 frames
/usr/local/lib/python3.7/dist-packages/polars/internals/frame.py in __init__(self, data, columns, orient)
    300 
    301         elif isinstance(data, dict):
--> 302             self._df = dict_to_pydf(data, columns=columns)
    303 
    304         elif isinstance(data, np.ndarray):

/usr/local/lib/python3.7/dist-packages/polars/internals/construction.py in dict_to_pydf(data, columns)
    400         return PyDataFrame(data_series)
    401     # fast path
--> 402     return PyDataFrame.read_dict(data)
    403 
    404 

/usr/local/lib/python3.7/dist-packages/polars/internals/series.py in __init__(self, name, values, dtype, strict, nan_to_null)
    225                 self._s = self.cast(dtype, strict=True)._s
    226         elif isinstance(values, Sequence):
--> 227             self._s = sequence_to_pyseries(name, values, dtype=dtype, strict=strict)
    228         elif _PANDAS_AVAILABLE and isinstance(values, (pd.Series, pd.DatetimeIndex)):
    229             self._s = pandas_to_pyseries(name, values)

/usr/local/lib/python3.7/dist-packages/polars/internals/construction.py in sequence_to_pyseries(name, values, dtype, strict)
    241             if constructor == PySeries.new_object:
    242                 try:
--> 243                     return PySeries.new_from_anyvalues(name, values)
    244                 # raised if we cannot convert to Wrap<AnyValue>
    245                 except RuntimeError:

ComputeError: struct orders must remain the same

CodePudding user response:

The error you are receiving is because your list of dictionaries does not conform to the expectations for a Series of struct in Polars. More specifically, your two dictionaries {'D1':'D2'} and {'DD1':'DD2'} are mapped to two different types of structs in Polars and thus are incompatible for inclusion in the same Series.

I'll first need to explain structs ...

Polars: Structs

In Polars, dictionaries are mapped to something called a struct. A struct is an ordered, named collection of typed data. (In this regard, a struct is much like a Polars DataFrame with only one row.)

In a struct:

  1. each field must have a unique field name
  2. each field has a datatype
  3. the order of the fields in a struct matters

Polars: Mapping Dictionaries to Structs

When dictionaries are mapped to structs (e.g., in a DataFrame constructor), each key in the dictionary is mapped to a field name in the struct and the corresponding dictionary value is assigned to the value of that field in the struct.

Also, the order of the keys in the dictionary matters: the fields of the struct are created in the same order as the keys in the dictionary. In Python, it's easy to forget that the keys in a dictionary are ordered.

Changed in version 3.7: Dictionary order is guaranteed to be insertion order. This behavior was an implementation detail of CPython from 3.6.

Polars: Series/Lists of structs

Here's where your input runs into trouble in Polars. A collection of structs can be included in the same Series only if:

  1. The structs have the same number of fields
  2. The fields have the same names
  3. The fields are in the same order
  4. The datatype of each field is the same for each of the structs.

In your input, {'D1':'D2'} is mapped to a struct with one field having a field name of "D1" and a value of "D2". However, {'DD1':'DD2'} is mapped to a struct with one field having field name "DD1" and value "DD2". As such, the resulting structs are not compatible for inclusion in the same Series. Their field names do not match.

In this instance, Polars is far more picky than Pandas, which allows for dictionaries with arbitrary key-value pairs to appear in the same column.

In general, you'll find that Polars is far more opinionated about data structures and data types than Pandas. (And part of the reason is performance-related.)

Workarounds

One workaround for your example is to alter your dictionaries so that they include the same keys, in the same order. For example:

template = {
    "a": ["A", "AA"],
    "b": ["B", "BB"],
    "c": ["C", "CC"],
    "d": [{"D1": "D2", "DD1": None}, {"D1": None, "DD1": "DD2"}],
}
pl.DataFrame(template)
shape: (2, 4)
┌─────┬─────┬─────┬──────────────┐
│ a   ┆ b   ┆ c   ┆ d            │
│ --- ┆ --- ┆ --- ┆ ---          │
│ str ┆ str ┆ str ┆ struct[2]    │
╞═════╪═════╪═════╪══════════════╡
│ A   ┆ B   ┆ C   ┆ {"D2",null}  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ AA  ┆ BB  ┆ CC  ┆ {null,"DD2"} │
└─────┴─────┴─────┴──────────────┘

Another easy workaround is to import the data into Pandas first, and then import the Pandas DataFrame into Polars. The import process will do the work for you.

template = {
    "a": ["A", "AA"],
    "b": ["B", "BB"],
    "c": ["C", "CC"],
    "d": [{"D1": "D2"}, {"DD1": "DD2"}],
}
pl.DataFrame(pd.DataFrame(template))
>>> pl.DataFrame(pd.DataFrame(template))
shape: (2, 4)
┌─────┬─────┬─────┬──────────────┐
│ a   ┆ b   ┆ c   ┆ d            │
│ --- ┆ --- ┆ --- ┆ ---          │
│ str ┆ str ┆ str ┆ struct[2]    │
╞═════╪═════╪═════╪══════════════╡
│ A   ┆ B   ┆ C   ┆ {"D2",null}  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ AA  ┆ BB  ┆ CC  ┆ {null,"DD2"} │
└─────┴─────┴─────┴──────────────┘

There may be other workarounds, but it will depend on your specific data and needs.

  • Related