dataframe = pd.DataFrame(
{
'name': ['Ruben Smith','Hannah Smith','Anton Smith','Momen Smith'],
'born_on': ['2020-03-05','2001-09-05','2000-11-05','1964-01-18'],
'estimated_at': ['2017-01-01', '2019-01-01', ' 2020-01-01', '2022-01-01'],
'weight_before 1': [4994,1111,8788,2020],
'class 1': [1,3,6,4],
'gender 1': ['F','F','M','M'],
'weight_before 2': [4444,None,8777,None],
'class 2': [2,None,11,None],
'gender 2': ['X',None,'B',None],
'weight_before 3': [None,None,None,8488],
'class 3': [None,None,None,22],
'gender 3': [None,None,None,'X']
}
)
using this command I was able to get the data into this format as desired:
pd.wide_to_long(df,['weight_before','class','gender'],['name','born_on', 'estimated_at'],'v',' ').dropna().droplevel(-1).reset_index()
name born_on estimated_at weight_before class gender
0 Ruben Smith 2020-03-05 2017-01-01 4994 1 F
1 Ruben Smith 2020-03-05 2017-01-01 4444 2 X
2 Hannah Smith 2001-09-05 2019-01-01 1111 3 F
3 Anton Smith 2000-11-05 2020-01-01 8788 6 M
4 Anton Smith 2000-11-05 2020-01-01 8777 11 B
5 Momen Smith 1964-01-18 2022-01-01 2020 4 M
6 Momen Smith 1964-01-18 2022-01-01 8488 22 X
but is it possible to use wide_to_long() if there were to be multiple instances of the dataframe above? like so:
pd.DataFrame(
{
'project 1 name': ['Ruben Smith','Hannah Smith','Anton Smith','Momen Smith'],
'project 1 born_on': ['2020-03-05','2001-09-05','2000-11-05','1964-01-18'],
'project 1 estimated_at': ['2017-01-01','2019-01-01','2020-01-01','2022-01-01'],
'project 1 weight_before 1': [4994,1111,8788,2020],
'project 1 class 1': [1,3,6,4],
'project 1 gender 1': ['F','F','M','M'],
'project 1 weight_before 2': [4444,None,8777,None],
'project 1 class 2': [2,None,11,None],
'project 1 gender 2': ['X',None,'B',None],
'project 1 weight_before 3': [None,None,None,8488],
'project 1 class 3': [None,None,None,22],
'project 1 gender 3': [None,None,None,'X'],
'project 2 name': ['Ruben Smith','Hannah Smith','Anton Smith','Momen Smith'],
'project 2 born_on': ['2020-03-05','2001-09-05','2000-11-05','1964-01-18'],
'project 2 estimated_at': ['2017-01-01','2019-01-01','2020-01-01','2022-01-01'],
'project 2 weight_before 1': [4994,1111,8788,2020],
'project 2 class 1': [1,3,6,4],
'project 2 gender 1': ['F','F','M','M'],
'project 2 weight_before 2': [4444,None,8777,None],
'project 2 class 2': [2,None,11,None],
'project 2 gender 2': ['X',None,'B',None],
'project 2 weight_before 3': [None,None,None,8488],
'project 2 class 3': [None,None,None,22],
'project 2 gender 3': [None,None,None,'X']
}
)
How would you proceed here? Should I split the dataset and apply my function to each of them to finally melt them back together?
CodePudding user response:
What do you think about this? split your columns to get a multiindex. make column level0 (=project 1 &2) a column of your df. continue with your line of code (almost the same) to get the same result except twice the result for project 1&2 along axis=0
df.columns = df.columns.str.split(r"(?<=\w \d)\s(?=\w)", expand=True)
#Multiindex as e.g: ('project 1', 'name'), ('project 2', 'class 1') and so on
# unstack your df and make project1&2 a column of the df
tmp = (df.unstack()
.unstack(level=1)
.reset_index(level=1, drop=True)
.rename_axis('project')
.reset_index())
print(tmp)
project born_on class 1 class 2 class 3 estimated_at gender 1 gender 2 gender 3 name weight_before 1 weight_before 2 weight_before 3
0 project 1 2020-03-05 1 2.0 NaN 2017-01-01 F X None Ruben Smith 4994 4444.0 NaN
1 project 1 2001-09-05 3 NaN NaN 2019-01-01 F None None Hannah Smith 1111 NaN NaN
2 project 1 2000-11-05 6 11.0 NaN 2020-01-01 M B None Anton Smith 8788 8777.0 NaN
3 project 1 1964-01-18 4 NaN 22.0 2022-01-01 M None X Momen Smith 2020 NaN 8488.0
4 project 2 2020-03-05 1 2.0 NaN 2017-01-01 F X None Ruben Smith 4994 4444.0 NaN
5 project 2 2001-09-05 3 NaN NaN 2019-01-01 F None None Hannah Smith 1111 NaN NaN
6 project 2 2000-11-05 6 11.0 NaN 2020-01-01 M B None Anton Smith 8788 8777.0 NaN
7 project 2 1964-01-18 4 NaN 22.0 2022-01-01 M None X Momen Smith 2020 NaN 8488.0
then add your code to finish it:
only difference is you have now one more column with 'project' which has to be added to i
in pd.wide_to_long
res = (pd.wide_to_long(res
,stubnames=['weight_before','class','gender']
,i=['name','born_on', 'estimated_at','project']
,j='v'
,sep=' ')
.dropna()
.droplevel(-1)
.reset_index())
print(res)
name born_on estimated_at project weight_before class gender
0 Ruben Smith 2020-03-05 2017-01-01 project 1 4994 1 F
1 Ruben Smith 2020-03-05 2017-01-01 project 1 4444.0 2.0 X
2 Hannah Smith 2001-09-05 2019-01-01 project 1 1111 3 F
3 Anton Smith 2000-11-05 2020-01-01 project 1 8788 6 M
4 Anton Smith 2000-11-05 2020-01-01 project 1 8777.0 11.0 B
5 Momen Smith 1964-01-18 2022-01-01 project 1 2020 4 M
6 Momen Smith 1964-01-18 2022-01-01 project 1 8488.0 22.0 X
7 Ruben Smith 2020-03-05 2017-01-01 project 2 4994 1 F
8 Ruben Smith 2020-03-05 2017-01-01 project 2 4444.0 2.0 X
9 Hannah Smith 2001-09-05 2019-01-01 project 2 1111 3 F
10 Anton Smith 2000-11-05 2020-01-01 project 2 8788 6 M
11 Anton Smith 2000-11-05 2020-01-01 project 2 8777.0 11.0 B
12 Momen Smith 1964-01-18 2022-01-01 project 2 2020 4 M
13 Momen Smith 1964-01-18 2022-01-01 project 2 8488.0 22.0 X