I have the following data frame
df = pd.DataFrame([
{"A": 1, "B": "20", "pairs": [(1,2), (2,3)]},
{"A": 2, "B": "22", "pairs": [(1,1), (2,2), (1,3)]},
{"A": 3, "B": "24", "pairs": [(1,1), (3,3)]},
{"A": 4, "B": "26", "pairs": [(1,3)]},
])
>>> df
A B pairs
0 1 20 [(1, 2), (2, 3)]
1 2 22 [(1, 1), (2, 2), (1, 3)]
2 3 24 [(1, 1), (3, 3)]
3 4 26 [(1, 3)]
Instead of these being a list of tuples, I'd like to make new columns for these pairs, p1 and p2, where these are ordered as the first and second members of each tuple respectively. There is also a wide to long element here in that I explode a single row into as many rows as there are pairs in the list.
This does not appear to fit a lot of the wide to long documentation I can find. My desired output format is this:
>>> df
A B p1 p2
0 1 20 1 2
1 1 20 2 3
2 2 22 1 1
3 2 22 2 2
4 2 22 1 3
5 3 24 1 1
6 3 24 3 3
7 4 26 1 3
CodePudding user response:
1st explode
then join
s = df.explode('pairs').reset_index(drop=True)
out = s.join(pd.DataFrame(s.pop('pairs').tolist(),columns=['p1','p2']))
out
Out[98]:
A B p1 p2
0 1 20 1 2
1 1 20 2 3
2 2 22 1 1
3 2 22 2 2
4 2 22 1 3
5 3 24 1 1
6 3 24 3 3
7 4 26 1 3
CodePudding user response:
Use explode
:
>>> df.join(df.pop('pairs').explode().apply(pd.Series)
.rename(columns={0: 'p1', 1: 'p2'}))
A B p1 p2
0 1 20 1 2
0 1 20 2 3
1 2 22 1 1
1 2 22 2 2
1 2 22 1 3
2 3 24 1 1
2 3 24 3 3
3 4 26 1 3
CodePudding user response:
Is this what you have in mind:
(df.explode('pairs') # blow it up into individual rows
.assign(p1 = lambda df: df.pairs.str[0],
p2 = lambda df: df.pairs.str[-1])
.drop(columns='pairs')
)
Out[1234]:
A B p1 p2
0 1 20 1 2
0 1 20 2 3
1 2 22 1 1
1 2 22 2 2
1 2 22 1 3
2 3 24 1 1
2 3 24 3 3
3 4 26 1 3
Another option, using the apply
method, and longer lines of code (peformance wise I have no idea which is better):
(df
.set_index(['A', 'B'])
.pairs
.apply(pd.Series)
.stack()
.apply(pd.Series)
.droplevel(-1)
.set_axis(['p1', 'p2'],axis=1)
.reset_index()
)
Out[1244]:
A B p1 p2
0 1 20 1 2
1 1 20 2 3
2 2 22 1 1
3 2 22 2 2
4 2 22 1 3
5 3 24 1 1
6 3 24 3 3
7 4 26 1 3
Since pair
is a list of tuples, you may get some performance if you move the wrangling/transformation into pure python, before recombining back into a DataFrame:
from itertools import chain
repeats = [*map(len, df.pairs)]
reshaped = chain.from_iterable(df.pairs)
reshaped = pd.DataFrame(reshaped,
columns = ['p1', 'p2'],
index = df.index.repeat(repeats))
df.drop(columns='pairs').join(reshaped)
Out[1265]:
A B p1 p2
0 1 20 1 2
0 1 20 2 3
1 2 22 1 1
1 2 22 2 2
1 2 22 1 3
2 3 24 1 1
2 3 24 3 3
3 4 26 1 3