I have a weird pandas problem I am not sure how to begin. Here are examples of my two datasets:
df1: This dataset has a yearly metric per state.
Metric Year State
8 1996 AL
6 1997 AL
4 1998 AL
5 1999 AL
7 2000 AL
20 2001 AL
21 2002 AL
20 2003 AL
34 1996 CA
35 1997 CA
36 1998 CA
22 1999 CA
20 2000 CA
22 2001 CA
24 2002 CA
df2: This dataset has a law (which I'm referring to as ID) instituted in the year for the state.
ID State Year
ABC123 AL 1999
DEF456 AL 2000
GHI789 AL 2001
JKL012 AL 2001
PQR678 CA 1999
STU901 CA 2000
YZA567 CA 2001
I want to determine if there was a significant difference in the average of the metric before and after the law was instituted in that state for each ID. I would essentially want a fourth column in df2 that is just the avg(metric after) - avg(metric before). My first instinct was to use a np.where statement, but was unsure how to properly write the statement. Here's my attempt:
df2['diff'] = np.where((df2['Year']==2000) & (df2['State']=='AL'),df1[df1['Year']<2000]['Metric'].mean()-df1[df1'Year']>2000]['Metric'].mean(),0)
I know this is not correct as the alternative condition is just 0 and this would only apply to the condition of the year = 2000 and for Alabama. It doesn't filter out the California metrics from the calculation either.
So, what I am looking for is an iterative way of getting that difference for every state-year combination.
Any help would be appreciated! Thank you!
CodePudding user response:
First make a full outer join of df1
and df2
on State
:
df3 = df2.merge(df1, on='State', suffixes=('', '_metric'))
Then get the average metrics before and after the law was introduced for each ID
, State
, Year
combination:
df3.groupby(['ID', 'State', 'Year']).apply(
lambda x: pd.Series([x.loc[x.Year_metric < x.Year, 'Metric'].mean(),
x.loc[x.Year_metric > x.Year, 'Metric'].mean()],
index=['before', 'after'])
)
Result:
before after
ID State Year
ABC123 AL 1999 6.00 17.000000
DEF456 AL 2000 5.75 20.333333
GHI789 AL 2001 6.00 20.500000
JKL012 AL 2001 6.00 20.500000
PQR678 CA 1999 35.00 22.000000
STU901 CA 2000 31.75 23.000000
YZA567 CA 2001 29.40 24.000000
To just see the difference you can do instead:
df3.groupby(['ID', 'State', 'Year'], as_index=False).apply(
lambda x: x.loc[x.Year_metric > x.Year, 'Metric'].mean() -
x.loc[x.Year_metric < x.Year, 'Metric'].mean()
).rename(columns={None: 'Difference'})
Result:
ID State Year Difference
0 ABC123 AL 1999 11.000000
1 DEF456 AL 2000 14.583333
2 GHI789 AL 2001 14.500000
3 JKL012 AL 2001 14.500000
4 PQR678 CA 1999 -13.000000
5 STU901 CA 2000 -8.750000
6 YZA567 CA 2001 -5.400000