Home > Software design >  Combine two columns with same column name using pandas
Combine two columns with same column name using pandas

Time:09-23

I have a dataframe (df) where two columns are of different length. I would like to combine these two columns as one column. How do I do that?

The table looks like the following:

Col_1 Col_2 Col_1 Col3
A1 12 A1 345
A2 34 A2 980
A3 098 A3 543
A4 8765
A5 765

I would like to combine Col_1 as one column in the table. The output should look like the following:

Desired Output:

Col_1 Col_2 Col3
A1 12 345
A2 34 980
A3 098 543
A4 8765
A5 765

What I tried so far?

df1 = df.columns[0]
df2 = df.columns[2]
df_merge_col = pd.merge(df1, df2, on='Col_1')

or

df["Col_1"] = df["Col_1"].astype(str)   df["Col_1"]

CodePudding user response:

If duplicated columns names for filtering by position use DataFrame.iloc, then remove NaNs rows by DataFrame.dropna with outer join:

df1 = df.iloc[:, :2].dropna(subset=['Col_1'])
df2 = df.iloc[:, 2:].dropna(subset=['Col_1'])
df_merge_col = pd.merge(df1, df2, on='Col_1', how='outer')
print (df_merge_col)
  Col_1  Col_2  Col3
0    A1   12.0   345
1    A2   34.0   980
2    A3   98.0   543
3    A4    NaN  8765
4    A5    NaN   765

CodePudding user response:

You might use np.where for this task following way:

import numpy as np
import pandas as pd
df = pd.DataFrame([['A1', 'A1'], ['A2', 'A2'], ['A3', None], ['A4', None]])
combined = np.where(df[0].isnull(), df[1], df[0])
df.drop(columns=[0, 1], inplace=True)
df[0] = combined
print(df)

output

    0
0  A1
1  A2
2  A3
3  A4

np.where is like ternary operator for pandas.Series where column 0 has not value get value from column 1 otherwise from column 0

CodePudding user response:

Here is a generic solution using MultiIndex and stack.

In summary, it de-duplicates the columns by adding a unique id, which we then use to stack the data. The rest is only sorting/cleaning.

dup_col = df.columns[df.columns.duplicated()]

df.columns = pd.MultiIndex.from_arrays([df.columns,
                                        df.groupby(level=0, axis=1).cumcount()
                                       ])

(df.stack(level=1)
   .sort_index(level=1)
   .droplevel(1)
   .drop_duplicates(subset=list(dup_col))
)

output:

  Col_1   Col_2  Col_3
0    A1    12.0  345.0
1    A2    34.0  980.0
2    A3    98.0  543.0
3    A4  8765.0    NaN
4    A5   765.0    NaN
  • Related