I have a pandas dataframe which is has 2 rows as the headers, the first row being the primary column names (a,b,c), and the second row being the secondary columns names(a1,b1,b2, etc..) :
| a | b | | c | |
|----|----|----|----|----|
| a1 | b1 | b2 | b1 | b2 |
|----|----|----|----|----|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
Is there a way to flatten such a dataframe into one df with one single header, but combining the strings so that I can see which secondary column names go with which primary column names to get something as this as the desired output:
| a_a1 | b_b1 | b_b2 | c_b1 | c_b2 |
|------|------|------|------|------|
| | | | | |
| | | | | |
| | | | | |
I've managed to use this code to get the single header row:
df.columns = df.columns.map("_".join).str.strip("")
which leaves me with a partial solution:
| a_a1 | b_b1 | _b2 | c_b1 | _b2 |
|------|------|-----|------|-----|
| | | | | |
| | | | | |
| | | | | |
where some of the columns do not have the name from the primary columns in the header (i.e. _b2
instead of b_b2
and c_b2
). Any idea how to amend my code to get the final output?
CodePudding user response:
We can create a DataFrame from the columns with Index.to_frame
then use mask
replace ''
with NaN
or missing, then use ffill
to populate the last valid value forward:
x = df.columns.to_frame(index=False)
x.mask(x.eq('')).ffill()
0 1
0 a a1
1 b b1
2 b b2 # Empty String filled with b
3 c b1
4 c b2
Then we can use MultiIndex.from_frame
with Index.map
to rebuild the MultiIndex and collapse:
pd.MultiIndex.from_frame(x.mask(x.eq('')).ffill()).map('_'.join)
Index(['a_a1', 'b_b1', 'b_b2', 'c_b1', 'c_b2'], dtype='object')
All together:
import pandas as pd
df = pd.DataFrame(
[range(5)],
columns=pd.MultiIndex.from_tuples([
('a', 'a1'), ('b', 'b1'), ('', 'b2'), ('c', 'b1'), ('', 'b2')
])
)
print('Before:', df, sep=f'\n{"-" * 16}\n')
x = df.columns.to_frame(index=False)
df.columns = pd.MultiIndex.from_frame(x.mask(x.eq('')).ffill()).map('_'.join)
print('After:', df, sep=f'\n{"-" * 31}\n')
Before:
----------------
a b c
a1 b1 b2 b1 b2
0 0 1 2 3 4
After:
-------------------------------
a_a1 b_b1 b_b2 c_b1 c_b2
0 0 1 2 3 4