Consider the following code to create a dummy dataset
import numpy as np
from scipy.stats import norm
import pandas as pd
np.random.seed(10)
n=3
space= norm(20, 5).rvs(n)
time= norm(10,2).rvs(n)
values = np.kron(space, time).reshape(n,n) norm(1,1).rvs([n,n])
### Output
array([[267.39784458, 300.81493866, 229.19163206],
[236.1940266 , 266.49469945, 204.01294305],
[122.55912977, 140.00957047, 106.28339745]])
I can put these data in a pandas dataframe using
space_names = ['A','B','C']
time_names = [2000,2001,2002]
df = pd.DataFrame(values, index=space_names, columns=time_names)
df
### Output
2000 2001 2002
A 267.397845 300.814939 229.191632
B 236.194027 266.494699 204.012943
C 122.559130 140.009570 106.283397
This is considered a wide dataset, where each observation lies in a table with 2 variable that acts as coordinates to identify it.
To make it a long-tidy dataset we can suse the .stack method of pandas dataframe
df.columns.name = 'time'
df.index.name = 'space'
df.stack().rename('value').reset_index()
### Output
space time value
0 A 2000 267.397845
1 A 2001 300.814939
2 A 2002 229.191632
3 B 2000 236.194027
4 B 2001 266.494699
5 B 2002 204.012943
6 C 2000 122.559130
7 C 2001 140.009570
8 C 2002 106.283397
My question is: how do I do exactly this thing but for a 3-dimensional dataset?
Let's imagine I have 2 observation for each space-time couple
s = 3
t = 4
r = 2
space_mus = norm(20, 5).rvs(s)
time_mus = norm(10,2).rvs(t)
values = np.kron(space_mus, time_mus)
values = values.repeat(r).reshape(s,t,r) norm(0,1).rvs([s,t,r])
values
### Output
array([[[286.50322099, 288.51266345],
[176.64303485, 175.38175877],
[136.01675917, 134.44328617]],
[[187.07608546, 185.4068411 ],
[112.86398438, 111.983463 ],
[ 85.99035255, 86.67236986]],
[[267.66833894, 269.45295404],
[162.30044715, 162.50564386],
[124.6374401 , 126.2315447 ]]])
How can I obtain the same structure for the dataframe as above?
Ugly solution
Personally i don't like this solution, and i think one might do it in a more elegant and pythonic way, but still might be useful for someone else so I will post my solution.
labels = ['{}{}{}'.format(i,j,k) for i in range(s) for j in range(t) for k in range(r)] #space, time, repetition
def flatten3d(k):
return [i for l in k for s in l for i in s]
value_series = pd.Series(flatten3d(values)).rename('y')
split_labels= [[i for i in l] for l in labels]
df = pd.DataFrame(split_labels, columns=['s','t','r'])
pd.concat([df, value_series], axis=1)
### Output
s t r y
0 0 0 0 266.2408815208753
1 0 0 1 266.13662442609433
2 0 1 0 299.53178992512954
3 0 1 1 300.13941632567605
4 0 2 0 229.39037800681405
5 0 2 1 227.22227496248507
6 0 3 0 281.76357915411995
7 0 3 1 280.9639352062619
8 1 0 0 235.8137644198259
9 1 0 1 234.23202459516452
10 1 1 0 265.19681013560034
11 1 1 1 266.5462102589883
12 1 2 0 200.730100791878
13 1 2 1 199.83217739700535
14 1 3 0 246.54018839875374
15 1 3 1 248.5496308586532
16 2 0 0 124.90916276929234
17 2 0 1 123.64788669199066
18 2 1 0 139.65391860786775
19 2 1 1 138.08044561039517
20 2 2 0 106.45276370157518
21 2 2 1 104.78351933651582
22 2 3 0 129.86043618610572
23 2 3 1 128.97991481257253
CodePudding user response:
This does not use stack
, but maybe it is acceptable for your problem:
import numpy as np
import pandas as pd
values = np.arange(18).reshape(3, 3, 2) # Your values here
index = pd.MultiIndex.from_product([space_names, space_names, time_names], names=["space1", "space2", "time"])
df = pd.DataFrame({"value": values.ravel()}, index=index).reset_index()
# df:
# space1 space2 time value
# 0 A A 2000 0
# 1 A A 2001 1
# 2 A B 2000 2
# 3 A B 2001 3
# 4 A C 2000 4
# 5 A C 2001 5
# 6 B A 2000 6
# 7 B A 2001 7
# 8 B B 2000 8
# 9 B B 2001 9
# 10 B C 2000 10
# 11 B C 2001 11
# 12 C A 2000 12
# 13 C A 2001 13
# 14 C B 2000 14
# 15 C B 2001 15
# 16 C C 2000 16
# 17 C C 2001 17