Python newbie to permutate duplicate columns
I have a dataframe like this.
a | b | c | c_dup1 | c_dup2 | c_dup3 | d | d_dup1 | d_dup2 | d_dup3 |
---|---|---|---|---|---|---|---|---|---|
hello | bye | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
I've only managed to do this for where only c_dup1 exist with the below logic.
data1 = {'a':'hello', 'b':'bye', 'c':1,'c_dup1':2}
df1 = pd.DataFrame(data1, index=[0])
for col in df1.columns:
x_df = df1
if '_dup' in col:
match_col = re.compile(r'(.*?)_')
col_name = match_col.findall(col)[0]
# Drop duplicate column from original
df1 = df1.drop(col, axis=1)
# Drop original column from duplicate
df_pdup = x_df.drop(col_name, axis=1)
df_pdup = df_pdup.rename(columns={col:col_name})
final = pd.concat([df1, df_pdup], ignore_index = True)
Output is :
a | b | c |
---|---|---|
hello | bye | 1 |
hello | bye | 2 |
How do I replicate the rows to get the below output? For my initial dataframe above?
a | b | c | d |
---|---|---|---|
hello | bye | 1 | 5 |
hello | bye | 2 | 6 |
hello | bye | 3 | 7 |
hello | bye | 4 | 8 |
CodePudding user response:
IIUC, you can use a manual reshaping with a MultiIndex:
cols = ['a', 'b']
out = (df
.set_index(cols)
.pipe(lambda d: d.set_axis(d.columns.str.split('_dup', expand=True), axis=1))
.stack()
.droplevel(-1).reset_index()
)
output:
a b c d
0 hello bye 1 5
1 hello bye 2 6
2 hello bye 3 7
3 hello bye 4 8
used input:
a b c c_dup1 c_dup2 c_dup3 d d_dup1 d_dup2 d_dup3
0 hello bye 1 2 3 4 5 6 7 8
For a programmatic way of getting a/b as the only columns that do not have an equivalent with '_dup'
, you can use:
import re
target = df.columns.str.extract('(.*)_dup', expand=False).dropna().unique()
# Index(['c', 'd'], dtype='object')
regex = fr"^({'|'.join(map(re.escape, target))})"
# ^(c|d)
cols = list(df.columns[~df.columns.str.contains(regex)])
# ['a', 'b']
NB. there might be limitations if there are overlapping suffixes (e.g. ABC/ABCD)