I'm trying to clean up a dataframe by merging the columns on a multi-index so all values in columns that belong to the same first-level index appear in one column.
From This:
I was doing it manually by defining each column and joining them like this:
df['Subjects'] = df['Which of the following subjects are you taking this semester?'].apply(lambda x: '|'.join(x.dropna()), axis = 1)
df.drop('Which of the following subjects are you taking this semester?', axis = 1, level = 0, inplace = True)
The problem is I have a large dataframe with many more columns then this, so I was wondering if there is a way to do this dynamically for all columns instead of copying this code and defining each column individually?
data = {('Name', ''): {0: 'Jane',
1: 'John',
2: 'Lisa',
3: 'Michael'},
('Location', ''): {0: 'Houston', 1: 'LA', 2: 'LA', 3:
'Dallas'},
('Which of the following subjects are you taking this
semester?', 'Math'): {0: 'Math',
1: 'Math',
2: np.nan,
3: 'Math'},
('Which of the following subjects are you taking this
semester?', 'Science'): {0: 'Science',
1: np.nan,
2: np.nan,
3: 'Science'},
('Which of the following subjects are you taking this
semester?', 'Art'): {0: np.nan,
1: 'Art',
2: 'Art',
3: np.nan},
('Which of the following electronic devices do you own?',
'Laptop'): {0: 'Laptop',
1: 'Laptop',
2: 'Laptop',
3: 'Laptop'},
('Which of the following electronic devices do you own?',
'Phone'): {0: 'Phone',
1: 'Phone',
2: 'Phone',
3: 'Phone'},
('Which of the following electronic devices do you own?',
'TV'): {0: np.nan,
1: 'TV',
2: np.nan,
3: np.nan},
('Which of the following electronic devices do you own?',
'Tablet'): {0: 'Tablet',
1: np.nan,
2: 'Tablet',
3: np.nan},
('Age', ''): {0: 24, 1: 20, 2: 19, 3: 29},
('Which Social Media Platforms Do You Use?', 'Instagram'):
{0: np.nan,
1: 'Instagram',
2: 'Instagram',
3: 'Instagram'},
('Which Social Media Platforms Do You Use?', 'Facebook'):
{0: 'Facebook',
1: 'Facebook',
2: np.nan,
3: np.nan},
('Which Social Media Platforms Do You Use?', 'Tik Tok'):
{0: np.nan,
1: 'Tik Tok',
2: 'Tik Tok',
3: np.nan},
('Which Social Media Platforms Do You Use?', 'LinkedIn'):
{0: 'LinkedIn',
1: 'LinkedIn',
2: np.nan,
3: np.nan}
}
CodePudding user response:
You can try this:
df.T.groupby(level=0).agg(list).T
CodePudding user response:
You can use melt
as starting point to flatten your dataframe, filter out nan values then pivot_table
to reshape your dataframe:
pat = r'(subjects|electronic devices|Social Media Platforms)'
cols = ['Name', 'Location', 'Age']
out = df.droplevel(1, axis=1).melt(cols, ignore_index=False).query('value.notna()')
out['variable'] = out['variable'].str.extract(pat, expand=False).str.title()
out = out.reset_index().pivot_table('value', ['index'] cols, 'variable', aggfunc='|'.join) \
.reset_index(cols).rename_axis(index=None, columns=None)
Output:
>>> out
Name Location Age Electronic Devices Social Media Platforms Subjects
0 Jane Houston 24 Laptop|Phone|Tablet Facebook|LinkedIn Math|Science
1 John LA 20 Laptop|Phone|TV Instagram|Facebook|Tik Tok|LinkedIn Math|Art
2 Lisa LA 19 Laptop|Phone|Tablet Instagram|Tik Tok Art
3 Michael Dallas 29 Laptop|Phone Instagram Math|Science