Home > database >  Multidimensional array restructuring like in pandas.stack
Multidimensional array restructuring like in pandas.stack

Time:11-25

Consider the following code to create a dummy dataset

import numpy as np
from scipy.stats import norm
import pandas as pd

np.random.seed(10)

n=3

space= norm(20, 5).rvs(n)
time= norm(10,2).rvs(n)

values = np.kron(space, time).reshape(n,n)   norm(1,1).rvs([n,n])

### Output  
array([[267.39784458, 300.81493866, 229.19163206],
       [236.1940266 , 266.49469945, 204.01294305],
       [122.55912977, 140.00957047, 106.28339745]])

I can put these data in a pandas dataframe using

space_names = ['A','B','C']
time_names = [2000,2001,2002]

df = pd.DataFrame(values, index=space_names, columns=time_names)
df  

### Output

    2000        2001        2002
A   267.397845  300.814939  229.191632
B   236.194027  266.494699  204.012943
C   122.559130  140.009570  106.283397

This is considered a wide dataset, where each observation lies in a table with 2 variable that acts as coordinates to identify it.
To make it a long-tidy dataset we can suse the .stack method of pandas dataframe

df.columns.name = 'time'
df.index.name = 'space'

df.stack().rename('value').reset_index()  
### Output  
    space   time    value
0   A      2000    267.397845
1   A      2001    300.814939
2   A      2002    229.191632
3   B      2000    236.194027
4   B      2001    266.494699
5   B      2002    204.012943
6   C      2000    122.559130
7   C      2001    140.009570
8   C      2002    106.283397

My question is: how do I do exactly this thing but for a 3-dimensional dataset?
Let's imagine I have 2 observation for each space-time couple

s = 3
t = 4
r = 2

space_mus = norm(20, 5).rvs(s)
time_mus = norm(10,2).rvs(t)

values = np.kron(space_mus, time_mus)
values = values.repeat(r).reshape(s,t,r)   norm(0,1).rvs([s,t,r])
values

### Output 
array([[[286.50322099, 288.51266345],
        [176.64303485, 175.38175877],
        [136.01675917, 134.44328617]],

       [[187.07608546, 185.4068411 ],
        [112.86398438, 111.983463  ],
        [ 85.99035255,  86.67236986]],

       [[267.66833894, 269.45295404],
        [162.30044715, 162.50564386],
        [124.6374401 , 126.2315447 ]]])

How can I obtain the same structure for the dataframe as above?

Ugly solution

Personally i don't like this solution, and i think one might do it in a more elegant and pythonic way, but still might be useful for someone else so I will post my solution.

labels = ['{}{}{}'.format(i,j,k) for i in range(s) for j in range(t) for k in range(r)] #space, time, repetition

def flatten3d(k):
    return [i for l in k for s in l for i in s]

value_series = pd.Series(flatten3d(values)).rename('y')

split_labels= [[i for i in l] for l in labels]
df = pd.DataFrame(split_labels, columns=['s','t','r'])

pd.concat([df, value_series], axis=1)

### Output  
    s   t   r   y
0   0   0   0   266.2408815208753
1   0   0   1   266.13662442609433
2   0   1   0   299.53178992512954
3   0   1   1   300.13941632567605
4   0   2   0   229.39037800681405
5   0   2   1   227.22227496248507
6   0   3   0   281.76357915411995
7   0   3   1   280.9639352062619
8   1   0   0   235.8137644198259
9   1   0   1   234.23202459516452
10  1   1   0   265.19681013560034
11  1   1   1   266.5462102589883
12  1   2   0   200.730100791878
13  1   2   1   199.83217739700535
14  1   3   0   246.54018839875374
15  1   3   1   248.5496308586532
16  2   0   0   124.90916276929234
17  2   0   1   123.64788669199066
18  2   1   0   139.65391860786775
19  2   1   1   138.08044561039517
20  2   2   0   106.45276370157518
21  2   2   1   104.78351933651582
22  2   3   0   129.86043618610572
23  2   3   1   128.97991481257253

CodePudding user response:

This does not use stack, but maybe it is acceptable for your problem:

import numpy as np
import pandas as pd

values = np.arange(18).reshape(3, 3, 2) # Your values here
index  = pd.MultiIndex.from_product([space_names, space_names, time_names], names=["space1", "space2", "time"])

df = pd.DataFrame({"value": values.ravel()}, index=index).reset_index()

# df:
#    space1 space2  time  value
# 0       A      A  2000      0
# 1       A      A  2001      1
# 2       A      B  2000      2
# 3       A      B  2001      3
# 4       A      C  2000      4
# 5       A      C  2001      5
# 6       B      A  2000      6
# 7       B      A  2001      7
# 8       B      B  2000      8
# 9       B      B  2001      9
# 10      B      C  2000     10
# 11      B      C  2001     11
# 12      C      A  2000     12
# 13      C      A  2001     13
# 14      C      B  2000     14
# 15      C      B  2001     15
# 16      C      C  2000     16
# 17      C      C  2001     17
  • Related