Transform rows with stage changes and dates into "to" and "from"-CodePudding

I have the following dataset showing the times when a person moves to a new stage:

Name	Stage	Amount	Date
Karen	One	$1	01/01/21
Karen	Two	$1	08/12/21
Karen	Three	$1	05/03/22
Jaren	Three	$4	02/02/21
Jaren	One	$4	07/19/22
Laren	One	$5	04/07/21
Laren	Two	$5	08/17/22

I'm looking to understand the flow of people and amount through the stages (the amounts are constant per person). So I need to transform the dataset into the following table:

From Stage	To Stage	Amount	Record Count
One	Two	$6	2
One	Three	$0	0
Two	One	$0	0
Two	Three	$1	1
Three	One	$4	1
Three	Two	$0	0

I'm looking to sum the amount only by person and take the number of people that moved from each stage to the other (all possibilities).

CodePudding user response：

# Convert to numerical:
df.Amount = df.Amount.str.lstrip('$').astype(int)

# Make Stage Categorical:
df.Stage = df.Stage.astype('category')

# Optional: Make sure Dates are sorted within each group.
# Wasn't needed in your sample data.
# df.Date = pd.to_datetime(df.Date)
# df = df.sort_values(['Name', 'Date'])

# Find Next Stage for each Stage: 
df['Next_Stage'] = df.groupby('Name')['Stage'].shift(-1)

# Now when we pivot, all categories are represented:
out = (df.pivot_table(index=['Stage', 'Next_Stage'], values='Amount', aggfunc=['sum', 'count'])
         .droplevel(1, 1) # Get rid of "Amount" header.
         .reset_index()   # Reset the Index.
         [lambda x: x['Stage'].ne(x['Next_Stage'])]) # Remove Rows where Stages are the same.
print(out)

Output:

   Stage Next_Stage  sum  count
1    One      Three    0      0
2    One        Two    6      2
3  Three        One    4      1
5  Three        Two    0      0
6    Two        One    0      0
7    Two      Three    1      1

CodePudding user response：

I went on a brute force approach to define the columns 'From Stage' and 'To Stage'

#redefine column as float
df['Amount'] = df['Amount'].apply(lambda x: x[1:]).astype('float')

#define new columns
df['From Stage'] = 'nil'
df['To Stage'] = 'nil'
for i in range(1, df.shape[0]):
    if df.loc[i-1, 'Name'] == df.loc[i, 'Name']:
        df.loc[i, 'From Stage'] = df.loc[i-1, 'Stage']
        df.loc[i, 'To Stage'] = df.loc[i, 'Stage']
df.drop(df.index[df['From Stage']=='nil'], inplace=True)
print(df)

df1 = df.pivot_table(index=['From Stage', 'To Stage'], values='Amount', aggfunc=['sum', 'count'])
print(df1)

Output:

    Name  Stage  Amount      Date From Stage To Stage
1  Karen    Two     1.0  08/12/21        One      Two
2  Karen  Three     1.0  05/03/22        Two    Three
4  Jaren    One     4.0  07/19/22      Three      One
6  Laren    Two     5.0  08/17/22        One      Two

                        sum   count
                    Amount$ Amount$
From Stage To Stage                
One        Two          6.0       2
Three      One          4.0       1
Two        Three        1.0       1

Edit: improved code

#redefine column as float, thanks @BeRT2me
df['Amount'] = df['Amount'].str[1:].astype('float')

#define new columns
df['From Name'] = df['Name'].shift(1)
df['From Stage'] = df['Stage'].shift(1)
df['To Stage'] = df['Stage']
df.drop(df.index[df['From Name']!=df['Name']], inplace=True)
print(df)

df1 = df.pivot_table(index=['From Stage', 'To Stage'], values='Amount', aggfunc=['sum', 'count'])
print(df1)

Output:

    Name  Stage  Amount      Date From Name From Stage To Stage
1  Karen    Two     1.0  08/12/21     Karen        One      Two
2  Karen  Three     1.0  05/03/22     Karen        Two    Three
4  Jaren    One     4.0  07/19/22     Jaren      Three      One
6  Laren    Two     5.0  08/17/22     Laren        One      Two

                       sum  count
                    Amount Amount
From Stage To Stage              
One        Two         6.0      2
Three      One         4.0      1
Two        Three       1.0      1