I have a dataframe that was a result of a join operation. This operation had multiple matches, resulting in multiple rows. I want to move resulting match rows to be moved in to columns. Here is an example:
import pandas as pd
a = pd.DataFrame([[111,2,3]], columns=['id', 'var1', 'var2'])
b = pd.DataFrame([[111,'999','some data'],
[111,'999888','some more data']],
columns=['id', 'B', 'C'])
c = pd.merge(a, b, on='id')
I get:
id var1 var2 B C
0 111 2 3 999 some data
1 111 2 3 999888 some more data
but really I want:
id var1 var2 B C B C
0 111 2 3 999 some data 999888 some more data
I was thinking pivot was what I wanted but it makes the value the columns, not what I want. How can I achieve this and what is this operation called?
EDIT: To clarify, I don't care about the column names, could be b1 and b2 etc.
CodePudding user response:
Use:
c.groupby('id').agg({'var1': lambda x: x[0], 'var2': lambda x: x[0], 'B': lambda x: list(x), 'C': lambda x: list(x)})
CodePudding user response:
IIUC, you can first reshape your dataframe b
to force having duplicated columns, then join
to a
:
b2 = (b
.assign(col=b.groupby('id').cumcount())
.pivot(index='id', columns='col')
.sort_index(level='col', axis=1, sort_remaining=False)
.droplevel('col', axis=1)
)
# B C B C
# id
# 111 999 some data 999888 some more data
c = a.join(b2, on='id')
# id var1 var2 B C B C
# 0 111 2 3 999 some data 999888 some more data
with non-duplicated column names:
b2 = (b.assign(col=b.groupby('id').cumcount().add(1))
.pivot(index='id', columns='col')
.sort_index(level='col', axis=1, sort_remaining=False)
.pipe(lambda d: d.set_axis(d.columns.map(lambda x: '_'.join(map(str,x))),
axis=1))
)
# B_1 C_1 B_2 C_2
# id
# 111 999 some data 999888 some more data
c = a.join(b2, on='id')
# id var1 var2 B_1 C_1 B_2 C_2
# 0 111 2 3 999 some data 999888 some more data
CodePudding user response:
Let us stack
and unstack
to reshape the dataframe:
k = ['id', 'var1', 'var2']
c = c.set_index([*k, c.groupby(k).cumcount().add(1)]).stack().unstack([-1, -2])
c.columns = c.columns.map('{0[0]}_{0[1]}'.format)
Result
print(c)
B_1 C_1 B_2 C_2
id var1 var2
111 2 3 999 some data 999888 some more data
CodePudding user response:
A solution in brief:
c = pd.concat([
c.groupby('id').nth(0).drop(columns=['B','C']).reset_index(),
c[['B','C']].unstack().to_frame().swaplevel().sort_index().T
], axis=1)
c = c.rename(columns={col:col[1] if isinstance(col, tuple) else col for col in c.columns})
Output:
id var1 var2 B C B C
0 111 2 3 999 some data 999888 some more data
Some details / variations:
To get a result without duplicate column names, instead having each column in each "result set" be a tuple of the result set number (0, 1, ...) and the column name (B
, C
):
import pandas as pd
a = pd.DataFrame([[111,2,3]], columns=['id', 'var1', 'var2'])
b = pd.DataFrame([[111,'999','some data'], [111,'999888','some more data']], columns=['id', 'B', 'C'])
c = pd.merge(a, b, on='id')
print(c)
x = c.groupby('id').nth(0).drop(columns=['B','C']).reset_index()
y = c[['B','C']].unstack().to_frame().swaplevel().sort_index().T
c = pd.concat([x, y], axis=1)
print(c)
Output:
id var1 var2 (0, B) (0, C) (1, B) (1, C)
0 111 2 3 999 some data 999888 some more data
After this, if you decide you want each B, C
result to be in a pair of columns with duplicate names (B, C
) you can then do this:
c.columns = [col[1] if isinstance(col, tuple) else col for col in c.columns]
... which will make c
look like this:
id var1 var2 B C B C
0 111 2 3 999 some data 999888 some more data