Aggregate multiple columns in a dataframe onto multiple columns using pandas.wide_to

dataframe = pd.DataFrame(
    {
    'name': ['Ruben Smith','Hannah Smith','Anton Smith','Momen Smith'],
    'born_on': ['2020-03-05','2001-09-05','2000-11-05','1964-01-18'],
    'estimated_at': ['2017-01-01', '2019-01-01', '  2020-01-01', '2022-01-01'],
    'weight_before 1': [4994,1111,8788,2020],
    'class 1': [1,3,6,4],
    'gender 1': ['F','F','M','M'],
    'weight_before 2': [4444,None,8777,None],
    'class 2': [2,None,11,None],
    'gender 2': ['X',None,'B',None],
    'weight_before 3': [None,None,None,8488],
    'class 3': [None,None,None,22],
    'gender 3': [None,None,None,'X']
    }
)

using this command I was able to get the data into this format as desired: pd.wide_to_long(df,['weight_before','class','gender'],['name','born_on', 'estimated_at'],'v',' ').dropna().droplevel(-1).reset_index()

           name    born_on estimated_at weight_before class gender   
0   Ruben Smith 2020-03-05   2017-01-01          4994     1     F
1   Ruben Smith 2020-03-05   2017-01-01          4444     2     X
2  Hannah Smith 2001-09-05   2019-01-01          1111     3     F
3   Anton Smith 2000-11-05   2020-01-01          8788     6     M
4   Anton Smith 2000-11-05   2020-01-01          8777    11     B
5   Momen Smith 1964-01-18   2022-01-01          2020     4     M  
6   Momen Smith 1964-01-18   2022-01-01          8488    22     X

but is it possible to use wide_to_long() if there were to be multiple instances of the dataframe above? like so:

pd.DataFrame(
    {
   'project 1 name': ['Ruben Smith','Hannah Smith','Anton Smith','Momen Smith'],
   'project 1 born_on': ['2020-03-05','2001-09-05','2000-11-05','1964-01-18'],
   'project 1 estimated_at': ['2017-01-01','2019-01-01','2020-01-01','2022-01-01'],
   'project 1 weight_before 1': [4994,1111,8788,2020],
   'project 1 class 1': [1,3,6,4],
   'project 1 gender 1': ['F','F','M','M'],
   'project 1 weight_before 2': [4444,None,8777,None],
   'project 1 class 2': [2,None,11,None],
   'project 1 gender 2': ['X',None,'B',None],
   'project 1 weight_before 3': [None,None,None,8488],
   'project 1 class 3': [None,None,None,22],
   'project 1 gender 3': [None,None,None,'X'],
   'project 2 name': ['Ruben Smith','Hannah Smith','Anton Smith','Momen Smith'],
   'project 2 born_on': ['2020-03-05','2001-09-05','2000-11-05','1964-01-18'],
   'project 2 estimated_at': ['2017-01-01','2019-01-01','2020-01-01','2022-01-01'],
   'project 2 weight_before 1': [4994,1111,8788,2020],
   'project 2 class 1': [1,3,6,4],
   'project 2 gender 1': ['F','F','M','M'],
   'project 2 weight_before 2': [4444,None,8777,None],
   'project 2 class 2': [2,None,11,None],
   'project 2 gender 2': ['X',None,'B',None],
   'project 2 weight_before 3': [None,None,None,8488],
   'project 2 class 3': [None,None,None,22],
   'project 2 gender 3': [None,None,None,'X']
    }
)

How would you proceed here? Should I split the dataset and apply my function to each of them to finally melt them back together?

CodePudding user response：

What do you think about this? split your columns to get a multiindex. make column level0 (=project 1 &2) a column of your df. continue with your line of code (almost the same) to get the same result except twice the result for project 1&2 along axis=0

df.columns = df.columns.str.split(r"(?<=\w \d)\s(?=\w)", expand=True)
#Multiindex as e.g: ('project 1', 'name'), ('project 2', 'class 1') and so on

# unstack your df and make project1&2 a column of the df
tmp = (df.unstack()
       .unstack(level=1)
       .reset_index(level=1, drop=True)
       .rename_axis('project')
       .reset_index())
print(tmp)

     project     born_on class 1 class 2 class 3 estimated_at gender 1 gender 2 gender 3          name weight_before 1 weight_before 2 weight_before 3
0  project 1  2020-03-05       1     2.0     NaN   2017-01-01        F        X     None   Ruben Smith            4994          4444.0             NaN
1  project 1  2001-09-05       3     NaN     NaN   2019-01-01        F     None     None  Hannah Smith            1111             NaN             NaN
2  project 1  2000-11-05       6    11.0     NaN   2020-01-01        M        B     None   Anton Smith            8788          8777.0             NaN
3  project 1  1964-01-18       4     NaN    22.0   2022-01-01        M     None        X   Momen Smith            2020             NaN          8488.0
4  project 2  2020-03-05       1     2.0     NaN   2017-01-01        F        X     None   Ruben Smith            4994          4444.0             NaN
5  project 2  2001-09-05       3     NaN     NaN   2019-01-01        F     None     None  Hannah Smith            1111             NaN             NaN
6  project 2  2000-11-05       6    11.0     NaN   2020-01-01        M        B     None   Anton Smith            8788          8777.0             NaN
7  project 2  1964-01-18       4     NaN    22.0   2022-01-01        M     None        X   Momen Smith            2020             NaN          8488.0

then add your code to finish it: only difference is you have now one more column with 'project' which has to be added to i in pd.wide_to_long

res = (pd.wide_to_long(res
                       ,stubnames=['weight_before','class','gender']
                       ,i=['name','born_on', 'estimated_at','project']
                       ,j='v'
                       ,sep=' ')
       .dropna()
       .droplevel(-1)
       .reset_index())

print(res)

            name     born_on estimated_at    project weight_before class gender
0    Ruben Smith  2020-03-05   2017-01-01  project 1          4994     1      F
1    Ruben Smith  2020-03-05   2017-01-01  project 1        4444.0   2.0      X
2   Hannah Smith  2001-09-05   2019-01-01  project 1          1111     3      F
3    Anton Smith  2000-11-05   2020-01-01  project 1          8788     6      M
4    Anton Smith  2000-11-05   2020-01-01  project 1        8777.0  11.0      B
5    Momen Smith  1964-01-18   2022-01-01  project 1          2020     4      M
6    Momen Smith  1964-01-18   2022-01-01  project 1        8488.0  22.0      X
7    Ruben Smith  2020-03-05   2017-01-01  project 2          4994     1      F
8    Ruben Smith  2020-03-05   2017-01-01  project 2        4444.0   2.0      X
9   Hannah Smith  2001-09-05   2019-01-01  project 2          1111     3      F
10   Anton Smith  2000-11-05   2020-01-01  project 2          8788     6      M
11   Anton Smith  2000-11-05   2020-01-01  project 2        8777.0  11.0      B
12   Momen Smith  1964-01-18   2022-01-01  project 2          2020     4      M
13   Momen Smith  1964-01-18   2022-01-01  project 2        8488.0  22.0      X