Combine two columns with same column name using pandas-CodePudding

I have a dataframe (df) where two columns are of different length. I would like to combine these two columns as one column. How do I do that?

The table looks like the following:

Col_1	Col_2	Col_1	Col3
A1	12	A1	345
A2	34	A2	980
A3	098	A3	543
		A4	8765
		A5	765

I would like to combine Col_1 as one column in the table. The output should look like the following:

Desired Output:

Col_1	Col_2	Col3
A1	12	345
A2	34	980
A3	098	543
A4		8765
A5		765

What I tried so far?

df1 = df.columns[0]
df2 = df.columns[2]
df_merge_col = pd.merge(df1, df2, on='Col_1')

df["Col_1"] = df["Col_1"].astype(str)   df["Col_1"]

CodePudding user response：

If duplicated columns names for filtering by position use DataFrame.iloc, then remove NaNs rows by DataFrame.dropna with outer join:

df1 = df.iloc[:, :2].dropna(subset=['Col_1'])
df2 = df.iloc[:, 2:].dropna(subset=['Col_1'])
df_merge_col = pd.merge(df1, df2, on='Col_1', how='outer')
print (df_merge_col)
  Col_1  Col_2  Col3
0    A1   12.0   345
1    A2   34.0   980
2    A3   98.0   543
3    A4    NaN  8765
4    A5    NaN   765

CodePudding user response：

You might use np.where for this task following way:

import numpy as np
import pandas as pd
df = pd.DataFrame([['A1', 'A1'], ['A2', 'A2'], ['A3', None], ['A4', None]])
combined = np.where(df[0].isnull(), df[1], df[0])
df.drop(columns=[0, 1], inplace=True)
df[0] = combined
print(df)

output

np.where is like ternary operator for pandas.Series where column 0 has not value get value from column 1 otherwise from column 0

CodePudding user response：

Here is a generic solution using MultiIndex and stack.

In summary, it de-duplicates the columns by adding a unique id, which we then use to stack the data. The rest is only sorting/cleaning.

dup_col = df.columns[df.columns.duplicated()]

df.columns = pd.MultiIndex.from_arrays([df.columns,
                                        df.groupby(level=0, axis=1).cumcount()
                                       ])

(df.stack(level=1)
   .sort_index(level=1)
   .droplevel(1)
   .drop_duplicates(subset=list(dup_col))
)

output:

  Col_1   Col_2  Col_3
0    A1    12.0  345.0
1    A2    34.0  980.0
2    A3    98.0  543.0
3    A4  8765.0    NaN
4    A5   765.0    NaN