I am trying to transpose a very large dataframe. I used Dask due to the size of the file and searched up how to transpose a dask dataframe.
import pandas as pd
import numpy as np
import dask.dataframe as dd
genematrix = r"C:\Users\fnafee\Desktop\tobeMerged\GENEMATRIX.csv"
genematrix_df = dd.read_csv(genematrix)
new_df = np.transpose(genematrix_df)
new_df.head()
It returns the following
---------------------------------------------------------------------------
UnboundLocalError Traceback (most recent call last)
Input In [39], in <cell line: 6>()
4 genematrix = r"C:\Users\fnafee\Desktop\tobeMerged\TSVSMERGED.csv"
5 genematrix_df = dd.read_csv(genematrix)
----> 6 new_df = np.transpose(genematrix_df)
7 new_df.head()
File <__array_function__ internals>:5, in transpose(*args, **kwargs)
File ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:660, in transpose(a, axes)
601 @array_function_dispatch(_transpose_dispatcher)
602 def transpose(a, axes=None):
603 """
604 Reverse or permute the axes of an array; returns the modified array.
605
(...)
658
659 """
--> 660 return _wrapfunc(a, 'transpose', axes)
File ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:54, in _wrapfunc(obj, method, *args, **kwds)
52 bound = getattr(obj, method, None)
53 if bound is None:
---> 54 return _wrapit(obj, method, *args, **kwds)
56 try:
57 return bound(*args, **kwds)
File ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:47, in _wrapit(obj, method, *args, **kwds)
45 if not isinstance(result, mu.ndarray):
46 result = asarray(result)
---> 47 result = wrap(result)
48 return result
File ~\Anaconda3\lib\site-packages\dask\dataframe\core.py:4213, in DataFrame.__array_wrap__(self, array, context)
4210 else:
4211 index = context[1][0].index
-> 4213 return pd.DataFrame(array, index=index, columns=self.columns)
UnboundLocalError: local variable 'index' referenced before assignment
The problem seems to come from some internal function that I have no control over. Do I need to change the way my file is formatted or should I try do this in small chunks at a time instead of one massive dataframe?
CodePudding user response:
Seems to be an indentation problem, since this error is saying that the variable index
is not assigned before the line
return pd.DataFrame(array, index=index, columns=self.columns)
CodePudding user response:
This seems like you've uncovered an unrelated bug in dask. This is a known issue (GH#6954), which so far seems like it only crops up in situations like this where you're using dask in a way that doesn't work anyway :)
This bug is just masking the true issue, which is that you cannot transpose a dask.dataframe. This is because a key feature of dask.dataframes is to allow the rows to be unknown, though the columns must be known. Therefore, transposing the dataframe would require computing the entire frame. If this really is a matrix, it seems you perhaps should be using dask.array, or xarray with a dask backend if you need the dimensions to be labeled.
For example, given a dask.dataframe:
import dask.dataframe as dd, pandas as pd, numpy as np
df = dd.from_pandas(pd.DataFrame({'A': np.arange(100, 200), 'B': np.random.random(size=100)}), npartitions=4)
This can be converted to a dask.Array using dask.dataframe.to_dask_array
, specifying lengths=True to define the chunk sizes:
In [13]: arr = df.to_dask_array(lengths=True)
In [14]: arr
Out[14]: dask.array<values, shape=(100, 2), dtype=float64, chunksize=(25, 2), chunktype=numpy.ndarray>
This array can now be transposed without computing the graph using dask.Array.transpose
or the equivalent .T
property:
In [15]: arr.T
Out[15]: dask.array<transpose, shape=(2, 100), dtype=float64, chunksize=(2, 25), chunktype=numpy.ndarray
This could be wrapped in an xarray.DataArray
if using coordinate labels is desired:
In [22]: import xarray as xr
...: da = xr.DataArray(
...: df.to_dask_array(lengths=True),
...: dims=['index', 'columns'],
...: coords=[df.index.compute(), df.columns],
...: )
In [23]: da
Out[23]:
<xarray.DataArray 'values-8d50dbfa8ed951a8ffb2ae5d5cd554bb' (index: 100,
columns: 2)>
dask.array<values, shape=(100, 2), dtype=float64, chunksize=(25, 2), chunktype=numpy.ndarray>
Coordinates:
* index (index) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
* columns (columns) object 'A' 'B'
In [24]: da.T
Out[24]:
<xarray.DataArray 'values-8d50dbfa8ed951a8ffb2ae5d5cd554bb' (columns: 2,
index: 100)>
dask.array<transpose, shape=(2, 100), dtype=float64, chunksize=(2, 25), chunktype=numpy.ndarray>
Coordinates:
* index (index) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
* columns (columns) object 'A' 'B'