Home > Mobile >  Trying to convert pandas df to np array, dtaidistance computes list instead
Trying to convert pandas df to np array, dtaidistance computes list instead

Time:04-21

I am attempting to compute the distance matrix for an ndarray that I have converted from pandas. I tried to convert the pandas df currently in this format:

move_df = 
        movement
0       [4, 3, 6, 2]
1       [5, 2, 3, 6, 2]
2       [4, 7, 2, 3, 6, 1]
3       [4, 4, 4, 3]
...     ...
33410   [2, 6, 3, 1, 8]
[33410 x 1 columns]

to a numpy ndarray by using the following:

1) m = move_df.to_numpy() 
2) m = pd.DataFrame(move_df.tolist()).values
3) m = [move_df.tolist() for i in move_df.columns]

Each of these conversions resulted in a numpy array in this format:

[[list([4, 3, 6, 2])]
 [list([5, 2, 3, 6, 2])]
 [list([4, 7, 2, 3, 6, 1])]
 [list([4, 4, 4, 3])]
 ...
 [list([2, 6, 3, 1, 8])]]

So when I try to run dtaidistance matrix, I get the following error:

d_m = dtw.distance_matrix(m)

TypeError: unsupported operand type(s) for -: 'list' and 'list'

But when I create a list of lists by copying and pasting several of the numpy arrays created with any of the methods mentioned above, the code works. But this is not feasible in the long run since the arrays are over 30k rows. Is there something I am doing wrong in the conversion from pandas df to numpy array? I used

print(type(m)) 

and it outputs that it is a numpy array and I already know that I cannot subtract a list from a list, hence the error.

EDIT:
For move_df.head(10).to_dict()

{'movement': {0: [4, 3, 6, 2], 
  1: [5, 2, 3, 6, 2], 
  2: [4, 7, 2, 3, 6, 1], 
  3: [4, 4, 4, 3], 
  4: [3, 6, 2, 3, 3], 
  5: [6, 2, 1], 
  6: [1, 1, 1, 1],
  7: [7, 2, 3, 1, 1],
  8: [7, 2, 3, 2, 1],
  9: [6, 2, 3, 1]}}

CodePudding user response:

Assuming you want to form an array with the lists of length 4:

m = df['movement'].str.len().eq(4)
a = np.array(df.loc[m, 'movement'].to_list())

output:

array([[4, 3, 6, 2],
       [4, 4, 4, 3],
       [1, 1, 1, 1],
       [6, 2, 3, 1]])

used input:

df = pd.DataFrame({'movement': [[4, 3, 6, 2],
                                [5, 2, 3, 6, 2],
                                [4, 7, 2, 3, 6, 1],
                                [4, 4, 4, 3], 
                                [3, 6, 2, 3, 3],
                                [6, 2, 1],
                                [1, 1, 1, 1],
                                [7, 2, 3, 1, 1],
                                [7, 2, 3, 2, 1],
                                [6, 2, 3, 1]]})

CodePudding user response:

A dataframe created with:

In [112]: df = pd.DataFrame({'movement': {0: [4, 3, 6, 2],
     ...:   1: [5, 2, 3, 6, 2],
     ...:   2: [4, 7, 2, 3, 6, 1],
     ...:   3: [4, 4, 4, 3],
     ...:   4: [3, 6, 2, 3, 3],
     ...:   5: [6, 2, 1],
     ...:   6: [1, 1, 1, 1],
     ...:   7: [7, 2, 3, 1, 1],
     ...:   8: [7, 2, 3, 2, 1],
     ...:   9: [6, 2, 3, 1]}})

has an object dtype column that contains lists. The array derived from that column is object dtype:

In [121]: arr = df['movement'].to_numpy()
In [122]: arr
Out[122]: 
array([list([4, 3, 6, 2]), list([5, 2, 3, 6, 2]),
       list([4, 7, 2, 3, 6, 1]), list([4, 4, 4, 3]),
       list([3, 6, 2, 3, 3]), list([6, 2, 1]), list([1, 1, 1, 1]),
       list([7, 2, 3, 1, 1]), list([7, 2, 3, 2, 1]), list([6, 2, 3, 1])],
      dtype=object)

By selecting the column I get a 1d array, not the 2d you get. Otherwise it's the same

This cannot be converted into a 2d numeric dtype array. For most purposes we can think of this as a list of lists.

In [123]: arr.tolist()
Out[123]: 
[[4, 3, 6, 2],
 [5, 2, 3, 6, 2],
 [4, 7, 2, 3, 6, 1],
 [4, 4, 4, 3],
 [3, 6, 2, 3, 3],
 [6, 2, 1],
 [1, 1, 1, 1],
 [7, 2, 3, 1, 1],
 [7, 2, 3, 2, 1],
 [6, 2, 3, 1]]

If the lists were all the same length, or if we pick a subset, it is possible to construct a 2d array:

In [125]: arr[[0,3,6,9]]
Out[125]: 
array([list([4, 3, 6, 2]), list([4, 4, 4, 3]), list([1, 1, 1, 1]),
       list([6, 2, 3, 1])], dtype=object)
In [126]: 
In [126]: np.stack(arr[[0,3,6,9]])
Out[126]: 
array([[4, 3, 6, 2],
       [4, 4, 4, 3],
       [1, 1, 1, 1],
       [6, 2, 3, 1]])

Padding and slicing could also be used to force the lists to matching lengths - but that could mean losing information.

But without knowing what dtw.distance_matrix expects (looks like it wants a 2d numeric array), or what these lists represent, I can't go further.

The fundamental point is that your dataframe contains lists that vary in length.

  • Related