A task I come up against reasonable often is something like the following transformation:
from:
home_team_id away_team_id home_team away_team
1 1 2 Arsenal Tottenham
2 2 3 Tottenham Chelsea
to
team value
1 Arsenal 1
2 Tottenham 2
3 Tottenham 2
4 Chelsea 3
in my head I refer to this as a 'double pivot' though maybe there's a more fitting name.
In R I can do this by (written off the top of my head- I'm sure the gsub can be optimised/cleaned somewhat):
library(tidyverse)
example_df_R = data.frame(
home_team_id = c(1, 2),
away_team_id = c(2,3),
home_team = c("Arsenal", "Tottenham"),
away_team = c("Tottenham", "Chelsea")
)
example_df_R %>%
pivot_longer(cols = ends_with("id")) %>%
pivot_longer(cols = ends_with("team"), values_to = "team", names_to = "team_location") %>%
filter(gsub("_id$", "", name) == team_location) %>%
select(team, value)
In python it feels like this should be the equivalent:
import pandas as pd
example_df_py = pd.DataFrame(
{
"home_team_id": [1, 2],
"away_team_id": [2, 3],
"home_team": ["Arsenal", "Tottenham"],
"away_team": ["Tottenham", "Chelsea"],
}
)
result = (
example_df_py.melt(id_vars=["home_team", "away_team"])
.melt(id_vars=["variable", "value"], var_name="team_location", value_name="team")
.loc[lambda dfr: dfr["variable"].str.startswith(dfr["team_location"].iloc[0])][
["team", "value"]
]
)
result
however that gives me:
team value
0 Arsenal 1
1 Tottenham 2
4 Tottenham 1
5 Chelsea 2
I fully understand why I get that result (I've included the iloc which means it isn't running row-by-row on both columns to make the code run), but not sure what the equivalent correct, 'elegant' (i.e. preferably in a chain for the context I frequently have to use), pythonic code is for the R posted above
Many thanks!
CodePudding user response:
I don't know if this is the best way to do this (or the most elegant one XD), but I managed to achieve the expected results using tuples and the explode method.
(example_df_py
.assign(team=lambda df: df[["home_team", "away_team"]].apply(tuple, axis=1), # (team1, team2), ...
value=lambda df: df[["home_team_id", "away_team_id"]].apply(tuple, axis=1)) # (id1, id2), ...
.explode(["team", "value"])
.loc[:, ["team", "value"]]
.reset_index(drop=True)
)
team value
0 Arsenal 1
1 Tottenham 2
2 Tottenham 2
3 Chelsea 3