I have the following DataFrame:
import pandas as pd
data = {'id': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'location':['Milan', 'Paris', 'New York', 'Rome', 'Los Angeles', 'Berlin', 'Madrid'],
'year': [2003,2004,2005, 2003, 2004, 2004, 2005]}
data = pd.DataFrame(data)
For each groupby('id')
, I want the combinations among the city i
at year
t
and all the cities at year
t-1, t-2, ..., t-n
.
The desired output:
data = {'id': ['A', 'A', 'A', 'A',
'B', 'B', 'B', 'B', 'B', 'B'],
'location':['Milan', 'Paris', 'New York', 'New York',
'Rome', 'Los Angeles', 'Berlin', 'Madrid','Madrid', 'Madrid'],
'year': [2003, 2004, 2005, 2005,
2003, 2004, 2004, 2005, 2005, 2005],
'comb': ['NaN', 'Milan', 'Milan','Paris',
'NaN', 'Rome', 'Rome', 'Rome','Los Angeles', 'Berlin']}
data = pd.DataFrame(data)
CodePudding user response:
Self merge, then query:
N = 2
out = (data.merge(data, on='id', suffixes=['','_comb'])
.query('0< year - year_comb <= @N')
)
Output:
id location year location_comb year_comb
3 A Paris 2004 Milan 2003
6 A New York 2005 Milan 2003
7 A New York 2005 Paris 2004
13 B Los Angeles 2004 Rome 2003
17 B Berlin 2004 Rome 2003
21 B Madrid 2005 Rome 2003
22 B Madrid 2005 Los Angeles 2004
23 B Madrid 2005 Berlin 2004
Note: the above does not contain the first location for each id, which can be obtained by df.drop_duplicates('id')
. So your final output would be
out = pd.concat([data.merge(data, on='id', suffixes=['','_comb'])
.query('0< year - year_comb <= @N'),
data.sort_values('year').drop_duplicates('id')]
)
CodePudding user response:
use a dictionary to get locations by year
data = {'id': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'location':['Milan', 'Paris', 'New York', 'Rome', 'Los Angeles', 'Berlin', 'Madrid'],
'year': [2003,2004,2005, 2003, 2004, 2004, 2005]}
df = pd.DataFrame(data)
print(df)
locations_by_year = {}
for year in df['year'].unique():
locations_by_year[year] = df[df['year'] == year]['location'].unique()
print(locations_by_year)
output:
{2003: array(['Milan', 'Rome'], dtype=object), 2004: array(['Paris', 'Los Angeles', 'Berlin'], dtype=object), 2005: array(['New York', 'Madrid'], dtype=object)}
locations by year:
df_grouped = df.groupby(['location'])
for name, group in df_grouped:
print(name)
print(group)
CodePudding user response:
Generate the full cartesian product (all combinations of all rows of the original dataframe). Then filter by df.year_comb < df.year
. This will also get rid of the rows with the first years for each id
. These can be re-added to produce the rows in the output df with the NaN values, if so desired.
df = (pd.merge(data, data.rename(columns={"location": "comb", "year": "year_comb"}), on=["id"])
.loc[lambda df: (df.year_comb < df.year)]
.drop(["year_comb"], axis=1)
)
# re-append the first years
data_first_years = data.sort_values(["year"]).groupby("id").first().reset_index()
df.append(data_first_years).sort_values(["id", "year"]).reset_index(drop=True)
# out:
id location year comb
0 A Milan 2003 NaN
1 A Paris 2004 Milan
2 A New York 2005 Milan
3 A New York 2005 Paris
4 B Rome 2003 NaN
5 B Los Angeles 2004 Rome
6 B Berlin 2004 Rome
7 B Madrid 2005 Rome
8 B Madrid 2005 Los Angeles
9 B Madrid 2005 Berlin