Home > Mobile >  Combinations within groupby pandas
Combinations within groupby pandas


I have the following DataFrame:

import pandas as pd

data = {'id': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
        'location':['Milan', 'Paris', 'New York', 'Rome', 'Los Angeles', 'Berlin', 'Madrid'],
        'year': [2003,2004,2005, 2003, 2004, 2004, 2005]}

data = pd.DataFrame(data)

For each groupby('id'), I want the combinations among the city i at year t and all the cities at year t-1, t-2, ..., t-n.

The desired output:

data = {'id': ['A', 'A', 'A', 'A',
               'B', 'B', 'B', 'B', 'B', 'B'],
        'location':['Milan', 'Paris', 'New York', 'New York',
                    'Rome', 'Los Angeles', 'Berlin', 'Madrid','Madrid', 'Madrid'],
        'year': [2003, 2004, 2005, 2005,
                 2003, 2004, 2004, 2005, 2005, 2005],
       'comb': ['NaN', 'Milan', 'Milan','Paris',  
                'NaN', 'Rome', 'Rome', 'Rome','Los Angeles', 'Berlin']}

data = pd.DataFrame(data)

CodePudding user response:

Self merge, then query:

N = 2
out = (data.merge(data, on='id', suffixes=['','_comb'])
           .query('0< year - year_comb <= @N')


   id     location  year location_comb  year_comb
3   A        Paris  2004         Milan       2003
6   A     New York  2005         Milan       2003
7   A     New York  2005         Paris       2004
13  B  Los Angeles  2004          Rome       2003
17  B       Berlin  2004          Rome       2003
21  B       Madrid  2005          Rome       2003
22  B       Madrid  2005   Los Angeles       2004
23  B       Madrid  2005        Berlin       2004

Note: the above does not contain the first location for each id, which can be obtained by df.drop_duplicates('id'). So your final output would be

out = pd.concat([data.merge(data, on='id', suffixes=['','_comb'])
                     .query('0< year - year_comb <= @N'),

CodePudding user response:

use a dictionary to get locations by year

data = {'id': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
        'location':['Milan', 'Paris', 'New York', 'Rome', 'Los Angeles', 'Berlin', 'Madrid'],
        'year': [2003,2004,2005, 2003, 2004, 2004, 2005]}

df = pd.DataFrame(data)

locations_by_year = {}
for year in df['year'].unique():
    locations_by_year[year] = df[df['year'] == year]['location'].unique()


{2003: array(['Milan', 'Rome'], dtype=object), 2004: array(['Paris', 'Los Angeles', 'Berlin'], dtype=object), 2005: array(['New York', 'Madrid'], dtype=object)}

locations by year:

df_grouped = df.groupby(['location'])
for name, group in df_grouped:

CodePudding user response:

Generate the full cartesian product (all combinations of all rows of the original dataframe). Then filter by df.year_comb < df.year. This will also get rid of the rows with the first years for each id. These can be re-added to produce the rows in the output df with the NaN values, if so desired.

df = (pd.merge(data, data.rename(columns={"location": "comb", "year": "year_comb"}), on=["id"])
            .loc[lambda df: (df.year_comb < df.year)]
            .drop(["year_comb"], axis=1)
# re-append the first years
data_first_years = data.sort_values(["year"]).groupby("id").first().reset_index()
df.append(data_first_years).sort_values(["id", "year"]).reset_index(drop=True)

# out:
  id     location  year         comb
0  A        Milan  2003          NaN
1  A        Paris  2004        Milan
2  A     New York  2005        Milan
3  A     New York  2005        Paris
4  B         Rome  2003          NaN
5  B  Los Angeles  2004         Rome
6  B       Berlin  2004         Rome
7  B       Madrid  2005         Rome
8  B       Madrid  2005  Los Angeles
9  B       Madrid  2005       Berlin
  • Related