Home > Mobile >  pythonic equivalent of R double pivot and filter
pythonic equivalent of R double pivot and filter

Time:01-17

A task I come up against reasonable often is something like the following transformation:

from:

  home_team_id away_team_id home_team away_team
1            1            2   Arsenal Tottenham
2            2            3 Tottenham   Chelsea

to

  team      value
1 Arsenal       1
2 Tottenham     2
3 Tottenham     2
4 Chelsea       3

in my head I refer to this as a 'double pivot' though maybe there's a more fitting name.

In R I can do this by (written off the top of my head- I'm sure the gsub can be optimised/cleaned somewhat):

library(tidyverse)
example_df_R = data.frame(
  home_team_id = c(1, 2), 
  away_team_id = c(2,3), 
  home_team = c("Arsenal", "Tottenham"), 
  away_team = c("Tottenham", "Chelsea")
)

example_df_R %>% 
  pivot_longer(cols = ends_with("id")) %>% 
  pivot_longer(cols = ends_with("team"), values_to = "team", names_to = "team_location") %>%
  filter(gsub("_id$", "", name) == team_location) %>%
  select(team, value)

In python it feels like this should be the equivalent:

import pandas as pd

example_df_py = pd.DataFrame(
    {
        "home_team_id": [1, 2],
        "away_team_id": [2, 3],
        "home_team": ["Arsenal", "Tottenham"],
        "away_team": ["Tottenham", "Chelsea"],
    }
)
result = (
    example_df_py.melt(id_vars=["home_team", "away_team"])
    .melt(id_vars=["variable", "value"], var_name="team_location", value_name="team")
    .loc[lambda dfr: dfr["variable"].str.startswith(dfr["team_location"].iloc[0])][
        ["team", "value"]
    ]
)

result

however that gives me:

        team  value
0    Arsenal      1
1  Tottenham      2
4  Tottenham      1
5    Chelsea      2

I fully understand why I get that result (I've included the iloc which means it isn't running row-by-row on both columns to make the code run), but not sure what the equivalent correct, 'elegant' (i.e. preferably in a chain for the context I frequently have to use), pythonic code is for the R posted above

Many thanks!

CodePudding user response:

I don't know if this is the best way to do this (or the most elegant one XD), but I managed to achieve the expected results using tuples and the explode method.

    (example_df_py
     .assign(team=lambda df: df[["home_team", "away_team"]].apply(tuple, axis=1), # (team1, team2), ...
             value=lambda df: df[["home_team_id", "away_team_id"]].apply(tuple, axis=1)) # (id1, id2), ...
     .explode(["team", "value"])
     .loc[:, ["team", "value"]]
     .reset_index(drop=True)
    )
    team      value
0   Arsenal       1
1   Tottenham     2
2   Tottenham     2
3   Chelsea       3
  • Related