I have a dataframe (df) where two columns are of different length. I would like to combine these two columns as one column. How do I do that?
The table looks like the following:
Col_1 | Col_2 | Col_1 | Col3 |
---|---|---|---|
A1 | 12 | A1 | 345 |
A2 | 34 | A2 | 980 |
A3 | 098 | A3 | 543 |
A4 | 8765 | ||
A5 | 765 |
I would like to combine Col_1 as one column in the table. The output should look like the following:
Desired Output:
Col_1 | Col_2 | Col3 |
---|---|---|
A1 | 12 | 345 |
A2 | 34 | 980 |
A3 | 098 | 543 |
A4 | 8765 | |
A5 | 765 |
What I tried so far?
df1 = df.columns[0]
df2 = df.columns[2]
df_merge_col = pd.merge(df1, df2, on='Col_1')
or
df["Col_1"] = df["Col_1"].astype(str) df["Col_1"]
CodePudding user response:
If duplicated columns names for filtering by position use DataFrame.iloc
, then remove NaNs rows by DataFrame.dropna
with outer join:
df1 = df.iloc[:, :2].dropna(subset=['Col_1'])
df2 = df.iloc[:, 2:].dropna(subset=['Col_1'])
df_merge_col = pd.merge(df1, df2, on='Col_1', how='outer')
print (df_merge_col)
Col_1 Col_2 Col3
0 A1 12.0 345
1 A2 34.0 980
2 A3 98.0 543
3 A4 NaN 8765
4 A5 NaN 765
CodePudding user response:
You might use np.where
for this task following way:
import numpy as np
import pandas as pd
df = pd.DataFrame([['A1', 'A1'], ['A2', 'A2'], ['A3', None], ['A4', None]])
combined = np.where(df[0].isnull(), df[1], df[0])
df.drop(columns=[0, 1], inplace=True)
df[0] = combined
print(df)
output
0
0 A1
1 A2
2 A3
3 A4
np.where
is like ternary operator for pandas.Series
where column 0 has not value get value from column 1 otherwise from column 0
CodePudding user response:
Here is a generic solution using MultiIndex
and stack
.
In summary, it de-duplicates the columns by adding a unique id, which we then use to stack
the data. The rest is only sorting/cleaning.
dup_col = df.columns[df.columns.duplicated()]
df.columns = pd.MultiIndex.from_arrays([df.columns,
df.groupby(level=0, axis=1).cumcount()
])
(df.stack(level=1)
.sort_index(level=1)
.droplevel(1)
.drop_duplicates(subset=list(dup_col))
)
output:
Col_1 Col_2 Col_3
0 A1 12.0 345.0
1 A2 34.0 980.0
2 A3 98.0 543.0
3 A4 8765.0 NaN
4 A5 765.0 NaN