Identify change in status due to change in categorical variable in panel data-CodePudding

I have unbalanced panel data (repeated observations per ID at different points in time). I need to identify for a change in variable per person over time.

Here is the code to generate the data frame:

df = pd.DataFrame(
    {
        "region": ["C1", "C1", "C2", "C2", "C2"],
        "id": [1, 1, 2, 2, 2],
        "date": ["01/01/2021", "01/02/2021", "01/01/2021", "01/02/2021", "01/03/2021"],
        "job": ["A", "A", "A", "B", "B"],
        }
)


df

I am trying to create a column ("change") that indicates when individual 2 changes job status from A to B on that date (01/02/2021).

I have tried the following, but it is giving me an error:

df['change']=df.groupby(['id'])['job'].diff().fillna(0)

CodePudding user response：

In your code error happens because you use 'diff' on 'job' column, but 'job' type is 'object' and 'diff' works only with numeric types.

current answer:

df["change"] = df.groupby(
["id"])["job"].transform(lambda x: x.ne(x.shift().bfill())).astype(int)

CodePudding user response：

Here is the (longer) solution that I worked out:

df = pd.DataFrame(
    {
        "region": ["C1", "C1", "C2", "C2", "C2"],
        "id": [1, 1, 2, 2, 2],
        "date": [0, 1, 0, 1, 2],
        "job": ["A", "A", "A", "B", "B"],
        }
)

df1 = df.set_index(['id', 'date']).sort_index()
df1['job_lag'] = df1.groupby(level='id')['job'].shift()
df1.job_lag.fillna(df1.job, inplace=True)
def change(x):
    if x['job'] != x['job_lag'] :
        return 1
    else:
        return 0
df1['dummy'] = df1.apply(change, axis=1)
df1