Create a DataFrame in Pandas with variable number of columns-CodePudding

Good afternoon, Here is my problem: I have a Pandas dataframe that looks like it:

Birth_Year      Name
1964            [X, Y]
1965            [A, B]
1964            [F, G, H]

I need to transform it to the following:

Birth_Year      Name_1          Name_2         Name_3
1964            X               Y
1965            A               B
1964            F               G              H

So I need to get the max length of lists in column 'Name', create the datafame with column for each item in the list, and populate it with the values from the list.

Would really appreciate an advice on how to handle this problem.

I have no good solution for it.

CodePudding user response：

Assuming you have real lists in "Name", you can use:

df = df.join(pd.DataFrame(df.pop('Name').tolist(), index=df.index)
               .fillna('')
               .rename(columns=lambda c: f'Name_{c 1}')
             )

Output:

   Birth_Year Name_1 Name_2 Name_3
0        1964      X      Y       
1        1965      A      B       
2        1964      F      G      H

CodePudding user response：

To get close to your data example:

import pandas as pd

df = pd.DataFrame(index=[1964, 1965, 1964],
                  data={'Name': [['X', 'Y'], ['A', 'B'], ['F', 'G', 'H']]})
df.index.name = 'Birth Year'

Solution:

df = df.Name.str.join(',').str.split(',', expand=True).fillna('')
df = df.rename(columns=lambda c: f'Name_{c 1}')

I found relying on the string methods to make this pretty easy to follow. But I think mozway's answer should be preferred. I believe it to be more efficient and faster.

Output:

           Name_1 Name_2 Name_3
Birth Year                     
1964            X      Y       
1965            A      B       
1964            F      G      H

CodePudding user response：

You can also try this:

df = pd.concat([df['Birth_Year'],
                (df['Name'].apply(lambda x: pd.Series([i for i in x]))
                 .pipe(lambda df2: df2.set_axis([f'Name_{i}' for i in range(1, len(df2.columns) 1)], axis=1)))
                ], axis=1)
df

   Birth_Year Name_1 Name_2 Name_3
0        1964      X      Y    NaN
1        1965      A      B    NaN
2        1964      F      G      H