Create pandas dataframe on column name conditions-CodePudding

Python newbie attempting a complex pandas dataframe logic

I have multiple dataframes I need to join but I'll show two below for the example. The dataframe have duplicate columns labelled with suffix '_duplicate'. I need to replicate the row instead of having the duplicate column as seen below.

My first thought is to get a list of unique column names then create an empty dataframe with those columns. Then have a for loop checking if column exists if so append, if the column_duplicate also append etc but unsure how to create this dataframe.

List_of_columns = ["a", "b", "c", "d", "id"]

Dataframe1:X

a	b	a_duplicate	b_duplicate	c	id
1	2	3	4	5	id1

Dataframe2:Y

a	c	a_duplicate	c_duplicate	d	id
6	7	8	9	10	id2

Created dataframe:

a	b	c	d	id
1	2	5	Null	id1
3	4	5	Null	id1
6	Null	7	10	id2
8	Null	7	10	id2

CodePudding user response：

This is a very silly way of doing it and I am hoping someone comes up with a better way... but it does work:

##################### Recreate OP's dataframe ###########################
data1 = {"a":1, "b":2, "a_duplicate":3,"b_duplicate":4,"c":5, "id":"id1"}
data2 = {"a":6, "c":7, "a_duplicate":8,"c_duplicate":9,"d":10, "id":"id2"}
df1 = pd.DataFrame(data1, index=[0])
df2 = pd.DataFrame(data2, index=[0])
#########################################################################

# Append columns together while renaming the duplicate columns
df1 = df1[["a", "b", "c", "id"]].append(df1[["a_duplicate", "b_duplicate", "c", "id"]].rename(columns={"a_duplicate": "a", "b_duplicate": "b"}))
df2 = df2[["a", "c", "d", "id"]].append(df2[["a_duplicate", "c_duplicate", "d", "id"]].rename(columns={"a_duplicate": "a", "c_duplicate": "c"}))

# Concatenate the resulting datafraames, reset the index, then put it in the correct column order
df3 = pd.concat([df1, df2], ignore_index=True)[["a", "b", "c", "d", "id"]]
df3

Output:

    a   b   c   d       id
0   1   2.0 5   NaN     id1
1   3   4.0 5   NaN     id1
2   6   NaN 7   10.0    id2
3   8   NaN 9   10.0    id2

CodePudding user response：

You can try

def explode(df):
    duplicate_cols = (df.columns.str.extract('(.*)_duplicate')
                      .dropna()[0].tolist())
    unduplicate_cols = (df.columns.difference(duplicate_cols)
                        .to_series()
                        [lambda s: ~s.str.contains('_duplicate')].tolist())
    out = df.T.groupby(df.columns.str.split('_').str[0]).agg(list).T
    out = (out.explode(duplicate_cols, ignore_index=True)
           .explode(unduplicate_cols, ignore_index=True))
    return out

out = pd.concat([explode(df1), explode(df2)], ignore_index=True)

print(out)

   a    b  c   id    d
0  1    2  5  id1  NaN
1  3    4  5  id1  NaN
2  6  NaN  7  id2   10
3  8  NaN  9  id2   10