Python newbie attempting a complex pandas dataframe logic
I have multiple dataframes I need to join but I'll show two below for the example. The dataframe have duplicate columns labelled with suffix '_duplicate'. I need to replicate the row instead of having the duplicate column as seen below.
My first thought is to get a list of unique column names then create an empty dataframe with those columns. Then have a for loop checking if column exists if so append, if the column_duplicate also append etc but unsure how to create this dataframe.
List_of_columns = ["a", "b", "c", "d", "id"]
Dataframe1:X
a | b | a_duplicate | b_duplicate | c | id |
---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | id1 |
Dataframe2:Y
a | c | a_duplicate | c_duplicate | d | id |
---|---|---|---|---|---|
6 | 7 | 8 | 9 | 10 | id2 |
Created dataframe:
a | b | c | d | id |
---|---|---|---|---|
1 | 2 | 5 | Null | id1 |
3 | 4 | 5 | Null | id1 |
6 | Null | 7 | 10 | id2 |
8 | Null | 7 | 10 | id2 |
CodePudding user response:
This is a very silly way of doing it and I am hoping someone comes up with a better way... but it does work:
##################### Recreate OP's dataframe ###########################
data1 = {"a":1, "b":2, "a_duplicate":3,"b_duplicate":4,"c":5, "id":"id1"}
data2 = {"a":6, "c":7, "a_duplicate":8,"c_duplicate":9,"d":10, "id":"id2"}
df1 = pd.DataFrame(data1, index=[0])
df2 = pd.DataFrame(data2, index=[0])
#########################################################################
# Append columns together while renaming the duplicate columns
df1 = df1[["a", "b", "c", "id"]].append(df1[["a_duplicate", "b_duplicate", "c", "id"]].rename(columns={"a_duplicate": "a", "b_duplicate": "b"}))
df2 = df2[["a", "c", "d", "id"]].append(df2[["a_duplicate", "c_duplicate", "d", "id"]].rename(columns={"a_duplicate": "a", "c_duplicate": "c"}))
# Concatenate the resulting datafraames, reset the index, then put it in the correct column order
df3 = pd.concat([df1, df2], ignore_index=True)[["a", "b", "c", "d", "id"]]
df3
Output:
a b c d id
0 1 2.0 5 NaN id1
1 3 4.0 5 NaN id1
2 6 NaN 7 10.0 id2
3 8 NaN 9 10.0 id2
CodePudding user response:
You can try
def explode(df):
duplicate_cols = (df.columns.str.extract('(.*)_duplicate')
.dropna()[0].tolist())
unduplicate_cols = (df.columns.difference(duplicate_cols)
.to_series()
[lambda s: ~s.str.contains('_duplicate')].tolist())
out = df.T.groupby(df.columns.str.split('_').str[0]).agg(list).T
out = (out.explode(duplicate_cols, ignore_index=True)
.explode(unduplicate_cols, ignore_index=True))
return out
out = pd.concat([explode(df1), explode(df2)], ignore_index=True)
print(out)
a b c id d
0 1 2 5 id1 NaN
1 3 4 5 id1 NaN
2 6 NaN 7 id2 10
3 8 NaN 9 id2 10