Home > Software engineering >  Create a new column based on condition from a column in another dataset
Create a new column based on condition from a column in another dataset

Time:06-18

I have a weird pandas problem I am not sure how to begin. Here are examples of my two datasets:

df1: This dataset has a yearly metric per state.

Metric   Year    State
8        1996    AL
6        1997    AL
4        1998    AL
5        1999    AL
7        2000    AL
20       2001    AL
21       2002    AL
20       2003    AL
34       1996    CA
35       1997    CA
36       1998    CA
22       1999    CA
20       2000    CA
22       2001    CA
24       2002    CA

df2: This dataset has a law (which I'm referring to as ID) instituted in the year for the state.

ID      State    Year
ABC123  AL       1999
DEF456  AL       2000
GHI789  AL       2001
JKL012  AL       2001
PQR678  CA       1999
STU901  CA       2000
YZA567  CA       2001

I want to determine if there was a significant difference in the average of the metric before and after the law was instituted in that state for each ID. I would essentially want a fourth column in df2 that is just the avg(metric after) - avg(metric before). My first instinct was to use a np.where statement, but was unsure how to properly write the statement. Here's my attempt:

df2['diff'] = np.where((df2['Year']==2000) & (df2['State']=='AL'),df1[df1['Year']<2000]['Metric'].mean()-df1[df1'Year']>2000]['Metric'].mean(),0)

I know this is not correct as the alternative condition is just 0 and this would only apply to the condition of the year = 2000 and for Alabama. It doesn't filter out the California metrics from the calculation either.

So, what I am looking for is an iterative way of getting that difference for every state-year combination.

Any help would be appreciated! Thank you!

CodePudding user response:

First make a full outer join of df1 and df2 on State:

df3 = df2.merge(df1, on='State', suffixes=('', '_metric'))

Then get the average metrics before and after the law was introduced for each ID, State, Year combination:

df3.groupby(['ID', 'State', 'Year']).apply(
    lambda x: pd.Series([x.loc[x.Year_metric < x.Year, 'Metric'].mean(),
                         x.loc[x.Year_metric > x.Year, 'Metric'].mean()],
                         index=['before', 'after'])
    )

Result:

                   before      after
ID     State Year                   
ABC123 AL    1999    6.00  17.000000
DEF456 AL    2000    5.75  20.333333
GHI789 AL    2001    6.00  20.500000
JKL012 AL    2001    6.00  20.500000
PQR678 CA    1999   35.00  22.000000
STU901 CA    2000   31.75  23.000000
YZA567 CA    2001   29.40  24.000000

To just see the difference you can do instead:

df3.groupby(['ID', 'State', 'Year'], as_index=False).apply(
    lambda x: x.loc[x.Year_metric > x.Year, 'Metric'].mean() -
              x.loc[x.Year_metric < x.Year, 'Metric'].mean()
    ).rename(columns={None: 'Difference'})

Result:

       ID State  Year  Difference
0  ABC123    AL  1999   11.000000
1  DEF456    AL  2000   14.583333
2  GHI789    AL  2001   14.500000
3  JKL012    AL  2001   14.500000
4  PQR678    CA  1999  -13.000000
5  STU901    CA  2000   -8.750000
6  YZA567    CA  2001   -5.400000
  • Related