unable to transpose dask.dataframe - getting Unbound Local Error-CodePudding

I am trying to transpose a very large dataframe. I used Dask due to the size of the file and searched up how to transpose a dask dataframe.

    import pandas as pd
    import numpy as np
    import dask.dataframe as dd
    genematrix = r"C:\Users\fnafee\Desktop\tobeMerged\GENEMATRIX.csv"
    genematrix_df = dd.read_csv(genematrix)
    new_df = np.transpose(genematrix_df)
    new_df.head()

It returns the following

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Input In [39], in <cell line: 6>()
        4 genematrix = r"C:\Users\fnafee\Desktop\tobeMerged\TSVSMERGED.csv"
        5 genematrix_df = dd.read_csv(genematrix)
  ----> 6 new_df = np.transpose(genematrix_df)
        7 new_df.head()

File <__array_function__ internals>:5, in transpose(*args, **kwargs)

File ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:660, in transpose(a, axes)
      601 @array_function_dispatch(_transpose_dispatcher)
      602 def transpose(a, axes=None):
      603     """
      604     Reverse or permute the axes of an array; returns the modified array.
      605 
     (...)
      658 
      659     """
  --> 660     return _wrapfunc(a, 'transpose', axes)

File ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:54, in _wrapfunc(obj, method, *args, **kwds)
       52 bound = getattr(obj, method, None)
       53 if bound is None:
  ---> 54     return _wrapit(obj, method, *args, **kwds)
       56 try:
       57     return bound(*args, **kwds)

File ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:47, in _wrapit(obj, method, *args, **kwds)
       45     if not isinstance(result, mu.ndarray):
       46         result = asarray(result)
  ---> 47     result = wrap(result)
       48 return result

File ~\Anaconda3\lib\site-packages\dask\dataframe\core.py:4213, in DataFrame.__array_wrap__(self, array, context)
     4210     else:
     4211         index = context[1][0].index
  -> 4213 return pd.DataFrame(array, index=index, columns=self.columns)

UnboundLocalError: local variable 'index' referenced before assignment

The problem seems to come from some internal function that I have no control over. Do I need to change the way my file is formatted or should I try do this in small chunks at a time instead of one massive dataframe?

CodePudding user response：

Seems to be an indentation problem, since this error is saying that the variable index is not assigned before the line

return pd.DataFrame(array, index=index, columns=self.columns)

CodePudding user response：

This seems like you've uncovered an unrelated bug in dask. This is a known issue (GH#6954), which so far seems like it only crops up in situations like this where you're using dask in a way that doesn't work anyway :)

This bug is just masking the true issue, which is that you cannot transpose a dask.dataframe. This is because a key feature of dask.dataframes is to allow the rows to be unknown, though the columns must be known. Therefore, transposing the dataframe would require computing the entire frame. If this really is a matrix, it seems you perhaps should be using dask.array, or xarray with a dask backend if you need the dimensions to be labeled.

For example, given a dask.dataframe:

import dask.dataframe as dd, pandas as pd, numpy as np
df = dd.from_pandas(pd.DataFrame({'A': np.arange(100, 200), 'B': np.random.random(size=100)}), npartitions=4)

This can be converted to a dask.Array using dask.dataframe.to_dask_array, specifying lengths=True to define the chunk sizes:

In [13]: arr = df.to_dask_array(lengths=True)

In [14]: arr
Out[14]: dask.array<values, shape=(100, 2), dtype=float64, chunksize=(25, 2), chunktype=numpy.ndarray>

This array can now be transposed without computing the graph using dask.Array.transpose or the equivalent .T property:

In [15]: arr.T
Out[15]: dask.array<transpose, shape=(2, 100), dtype=float64, chunksize=(2, 25), chunktype=numpy.ndarray

This could be wrapped in an xarray.DataArray if using coordinate labels is desired:

In [22]: import xarray as xr
    ...: da = xr.DataArray(
    ...:     df.to_dask_array(lengths=True),
    ...:     dims=['index', 'columns'],
    ...:     coords=[df.index.compute(), df.columns],
    ...: )

In [23]: da
Out[23]:
<xarray.DataArray 'values-8d50dbfa8ed951a8ffb2ae5d5cd554bb' (index: 100,
                                                             columns: 2)>
dask.array<values, shape=(100, 2), dtype=float64, chunksize=(25, 2), chunktype=numpy.ndarray>
Coordinates:
  * index    (index) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
  * columns  (columns) object 'A' 'B'

In [24]: da.T
Out[24]:
<xarray.DataArray 'values-8d50dbfa8ed951a8ffb2ae5d5cd554bb' (columns: 2,
                                                             index: 100)>
dask.array<transpose, shape=(2, 100), dtype=float64, chunksize=(2, 25), chunktype=numpy.ndarray>
Coordinates:
  * index    (index) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
  * columns  (columns) object 'A' 'B'