I'm trying to create a very simple Pandas DataFrame from a dictionary. The dictionary has 3 items, and the DataFrame as well. They are:
- a list with the 'shape' (3,)
- a list/np.array (in different attempts) with the shape(3, 3)
- a constant of 100 (same value to the whole column)
- Here is the code that succeeds and displays the preferred df
# from a dicitionary
>>>dict1 = {"x": [1, 2, 3],
... "y": list(
... [
... [2, 4, 6],
... [3, 6, 9],
... [4, 8, 12]
... ]
... ),
... "z": 100}
>>>df1 = pd.DataFrame(dict1)
>>>df1
x y z
0 1 [2, 4, 6] 100
1 2 [3, 6, 9] 100
2 3 [4, 8, 12] 100
- But then I assign a Numpy ndarray (shape 3, 3 )to the key
y
, and try to create a DataFrame from the dictionary. The line I try to create the DataFrame errors out. Below is the code I try to run, and the error I get (in separate code blocks for ease of reading.)
- code
>>>dict2 = {"x": [1, 2, 3],
... "y": np.array(
... [
... [2, 4, 6],
... [3, 6, 9],
... [4, 8, 12]
... ]
... ),
... "z": 100}
>>>df2 = pd.DataFrame(dict2) # see the below block for error
- error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
d:\studies\compsci\pyscripts\study\pandas-realpython\data-delightful\01.intro.ipynb Cell 10' in <module>
1 # from a dicitionary
2 dict1 = {"x": [1, 2, 3],
3 "y": np.array(
4 [
(...)
9 ),
10 "z": 100}
---> 12 df1 = pd.DataFrame(dict1)
File ~\anaconda3\envs\dst\lib\site-packages\pandas\core\frame.py:636, in DataFrame.__init__(self, data, index, columns, dtype, copy)
630 mgr = self._init_mgr(
631 data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
632 )
634 elif isinstance(data, dict):
635 # GH#38939 de facto copy defaults to False only in non-dict cases
--> 636 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
637 elif isinstance(data, ma.MaskedArray):
638 import numpy.ma.mrecords as mrecords
File ~\anaconda3\envs\dst\lib\site-packages\pandas\core\internals\construction.py:502, in dict_to_mgr(data, index, columns, dtype, typ, copy)
494 arrays = [
495 x
496 if not hasattr(x, "dtype") or not isinstance(x.dtype, ExtensionDtype)
497 else x.copy()
498 for x in arrays
499 ]
500 # TODO: can we get rid of the dt64tz special case above?
--> 502 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File ~\anaconda3\envs\dst\lib\site-packages\pandas\core\internals\construction.py:120, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
117 if verify_integrity:
118 # figure out the index, if necessary
119 if index is None:
--> 120 index = _extract_index(arrays)
121 else:
122 index = ensure_index(index)
File ~\anaconda3\envs\dst\lib\site-packages\pandas\core\internals\construction.py:661, in _extract_index(data)
659 raw_lengths.append(len(val))
660 elif isinstance(val, np.ndarray) and val.ndim > 1:
--> 661 raise ValueError("Per-column arrays must each be 1-dimensional")
663 if not indexes and not raw_lengths:
664 raise ValueError("If using all scalar values, you must pass an index")
ValueError: Per-column arrays must each be 1-dimensional
Why is it ending in error like that in the second attempt, even though the dimensions of both arrays are the same? What is a workaround for this issue?
CodePudding user response:
If you look closer at the error message and quick look at the source code here:
elif isinstance(val, np.ndarray) and val.ndim > 1:
raise ValueError("Per-column arrays must each be 1-dimensional")
You will find that if the dictionay value is a numpy array and has more than one dimension as your example, it throws an error based on the source code. Therefore, it works very well with list because a list has no more than one dimension even if it is a list of list.
lst = [[1,2,3],[4,5,6],[7,8,9]]
len(lst) # print 3 elements or (3,) not (3,3) like numpy array.
You can try to use np.array([1,2,3]), it will work because number of dimensions is 1 and try:
arr = np.array([1,2,3])
print(arr.ndim) # output is 1
If it is necessary to use numpy array inside a dictionary, you can use .tolist()
to convert numpy array to a list.