Pandas flatten and merge multiheader dataframe and concat header names-CodePudding

I have a pandas dataframe which is has 2 rows as the headers, the first row being the primary column names (a,b,c), and the second row being the secondary columns names(a1,b1,b2, etc..) :

| a  | b  |    | c  |    |
|----|----|----|----|----|
| a1 | b1 | b2 | b1 | b2 |
|----|----|----|----|----|
|    |    |    |    |    |
|    |    |    |    |    |
|    |    |    |    |    |
|    |    |    |    |    |

Is there a way to flatten such a dataframe into one df with one single header, but combining the strings so that I can see which secondary column names go with which primary column names to get something as this as the desired output:

| a_a1 | b_b1 | b_b2 | c_b1 | c_b2 |
|------|------|------|------|------|
|      |      |      |      |      |
|      |      |      |      |      |
|      |      |      |      |      |

I've managed to use this code to get the single header row:

df.columns = df.columns.map("_".join).str.strip("")

which leaves me with a partial solution:

| a_a1 | b_b1 | _b2 | c_b1 | _b2 |
|------|------|-----|------|-----|
|      |      |     |      |     |
|      |      |     |      |     |
|      |      |     |      |     |

where some of the columns do not have the name from the primary columns in the header (i.e. _b2 instead of b_b2 and c_b2). Any idea how to amend my code to get the final output?

CodePudding user response：

We can create a DataFrame from the columns with Index.to_frame then use mask replace '' with NaN or missing, then use ffill to populate the last valid value forward:

x = df.columns.to_frame(index=False)
x.mask(x.eq('')).ffill()
   0   1
0  a  a1
1  b  b1
2  b  b2  # Empty String filled with b
3  c  b1
4  c  b2

Then we can use MultiIndex.from_frame with Index.map to rebuild the MultiIndex and collapse:

pd.MultiIndex.from_frame(x.mask(x.eq('')).ffill()).map('_'.join)
Index(['a_a1', 'b_b1', 'b_b2', 'c_b1', 'c_b2'], dtype='object')

All together:

import pandas as pd

df = pd.DataFrame(
    [range(5)],
    columns=pd.MultiIndex.from_tuples([
        ('a', 'a1'), ('b', 'b1'), ('', 'b2'), ('c', 'b1'), ('', 'b2')
    ])
)
print('Before:', df, sep=f'\n{"-" * 16}\n')
x = df.columns.to_frame(index=False)
df.columns = pd.MultiIndex.from_frame(x.mask(x.eq('')).ffill()).map('_'.join)
print('After:', df, sep=f'\n{"-" * 31}\n')

Before:
----------------
   a  b     c   
  a1 b1 b2 b1 b2
0  0  1  2  3  4
After:
-------------------------------
   a_a1  b_b1  b_b2  c_b1  c_b2
0     0     1     2     3     4