Home > OS >  Aggregate multiple columns in a dataframe onto multiple columns using pandas.wide_to_long()
Aggregate multiple columns in a dataframe onto multiple columns using pandas.wide_to_long()

Time:05-10

dataframe = pd.DataFrame(
    {
    'name': ['Ruben Smith','Hannah Smith','Anton Smith','Momen Smith'],
    'born_on': ['2020-03-05','2001-09-05','2000-11-05','1964-01-18'],
    'estimated_at': ['2017-01-01', '2019-01-01', '  2020-01-01', '2022-01-01'],
    'weight_before 1': [4994,1111,8788,2020],
    'class 1': [1,3,6,4],
    'gender 1': ['F','F','M','M'],
    'weight_before 2': [4444,None,8777,None],
    'class 2': [2,None,11,None],
    'gender 2': ['X',None,'B',None],
    'weight_before 3': [None,None,None,8488],
    'class 3': [None,None,None,22],
    'gender 3': [None,None,None,'X']
    }
)

using this command I was able to get the data into this format as desired: pd.wide_to_long(df,['weight_before','class','gender'],['name','born_on', 'estimated_at'],'v',' ').dropna().droplevel(-1).reset_index()

           name    born_on estimated_at weight_before class gender   
0   Ruben Smith 2020-03-05   2017-01-01          4994     1     F
1   Ruben Smith 2020-03-05   2017-01-01          4444     2     X
2  Hannah Smith 2001-09-05   2019-01-01          1111     3     F
3   Anton Smith 2000-11-05   2020-01-01          8788     6     M
4   Anton Smith 2000-11-05   2020-01-01          8777    11     B
5   Momen Smith 1964-01-18   2022-01-01          2020     4     M  
6   Momen Smith 1964-01-18   2022-01-01          8488    22     X 

but is it possible to use wide_to_long() if there were to be multiple instances of the dataframe above? like so:

pd.DataFrame(
    {
   'project 1 name': ['Ruben Smith','Hannah Smith','Anton Smith','Momen Smith'],
   'project 1 born_on': ['2020-03-05','2001-09-05','2000-11-05','1964-01-18'],
   'project 1 estimated_at': ['2017-01-01','2019-01-01','2020-01-01','2022-01-01'],
   'project 1 weight_before 1': [4994,1111,8788,2020],
   'project 1 class 1': [1,3,6,4],
   'project 1 gender 1': ['F','F','M','M'],
   'project 1 weight_before 2': [4444,None,8777,None],
   'project 1 class 2': [2,None,11,None],
   'project 1 gender 2': ['X',None,'B',None],
   'project 1 weight_before 3': [None,None,None,8488],
   'project 1 class 3': [None,None,None,22],
   'project 1 gender 3': [None,None,None,'X'],
   'project 2 name': ['Ruben Smith','Hannah Smith','Anton Smith','Momen Smith'],
   'project 2 born_on': ['2020-03-05','2001-09-05','2000-11-05','1964-01-18'],
   'project 2 estimated_at': ['2017-01-01','2019-01-01','2020-01-01','2022-01-01'],
   'project 2 weight_before 1': [4994,1111,8788,2020],
   'project 2 class 1': [1,3,6,4],
   'project 2 gender 1': ['F','F','M','M'],
   'project 2 weight_before 2': [4444,None,8777,None],
   'project 2 class 2': [2,None,11,None],
   'project 2 gender 2': ['X',None,'B',None],
   'project 2 weight_before 3': [None,None,None,8488],
   'project 2 class 3': [None,None,None,22],
   'project 2 gender 3': [None,None,None,'X']
    }
)

How would you proceed here? Should I split the dataset and apply my function to each of them to finally melt them back together?

CodePudding user response:

What do you think about this? split your columns to get a multiindex. make column level0 (=project 1 &2) a column of your df. continue with your line of code (almost the same) to get the same result except twice the result for project 1&2 along axis=0

df.columns = df.columns.str.split(r"(?<=\w \d)\s(?=\w)", expand=True)
#Multiindex as e.g: ('project 1', 'name'), ('project 2', 'class 1') and so on

# unstack your df and make project1&2 a column of the df
tmp = (df.unstack()
       .unstack(level=1)
       .reset_index(level=1, drop=True)
       .rename_axis('project')
       .reset_index())
print(tmp)

     project     born_on class 1 class 2 class 3 estimated_at gender 1 gender 2 gender 3          name weight_before 1 weight_before 2 weight_before 3
0  project 1  2020-03-05       1     2.0     NaN   2017-01-01        F        X     None   Ruben Smith            4994          4444.0             NaN
1  project 1  2001-09-05       3     NaN     NaN   2019-01-01        F     None     None  Hannah Smith            1111             NaN             NaN
2  project 1  2000-11-05       6    11.0     NaN   2020-01-01        M        B     None   Anton Smith            8788          8777.0             NaN
3  project 1  1964-01-18       4     NaN    22.0   2022-01-01        M     None        X   Momen Smith            2020             NaN          8488.0
4  project 2  2020-03-05       1     2.0     NaN   2017-01-01        F        X     None   Ruben Smith            4994          4444.0             NaN
5  project 2  2001-09-05       3     NaN     NaN   2019-01-01        F     None     None  Hannah Smith            1111             NaN             NaN
6  project 2  2000-11-05       6    11.0     NaN   2020-01-01        M        B     None   Anton Smith            8788          8777.0             NaN
7  project 2  1964-01-18       4     NaN    22.0   2022-01-01        M     None        X   Momen Smith            2020             NaN          8488.0

then add your code to finish it: only difference is you have now one more column with 'project' which has to be added to i in pd.wide_to_long

res = (pd.wide_to_long(res
                       ,stubnames=['weight_before','class','gender']
                       ,i=['name','born_on', 'estimated_at','project']
                       ,j='v'
                       ,sep=' ')
       .dropna()
       .droplevel(-1)
       .reset_index())

print(res)

            name     born_on estimated_at    project weight_before class gender
0    Ruben Smith  2020-03-05   2017-01-01  project 1          4994     1      F
1    Ruben Smith  2020-03-05   2017-01-01  project 1        4444.0   2.0      X
2   Hannah Smith  2001-09-05   2019-01-01  project 1          1111     3      F
3    Anton Smith  2000-11-05   2020-01-01  project 1          8788     6      M
4    Anton Smith  2000-11-05   2020-01-01  project 1        8777.0  11.0      B
5    Momen Smith  1964-01-18   2022-01-01  project 1          2020     4      M
6    Momen Smith  1964-01-18   2022-01-01  project 1        8488.0  22.0      X
7    Ruben Smith  2020-03-05   2017-01-01  project 2          4994     1      F
8    Ruben Smith  2020-03-05   2017-01-01  project 2        4444.0   2.0      X
9   Hannah Smith  2001-09-05   2019-01-01  project 2          1111     3      F
10   Anton Smith  2000-11-05   2020-01-01  project 2          8788     6      M
11   Anton Smith  2000-11-05   2020-01-01  project 2        8777.0  11.0      B
12   Momen Smith  1964-01-18   2022-01-01  project 2          2020     4      M
13   Momen Smith  1964-01-18   2022-01-01  project 2        8488.0  22.0      X

  • Related