I have the following dataset showing the times when a person moves to a new stage:
Name | Stage | Amount | Date |
---|---|---|---|
Karen | One | $1 | 01/01/21 |
Karen | Two | $1 | 08/12/21 |
Karen | Three | $1 | 05/03/22 |
Jaren | Three | $4 | 02/02/21 |
Jaren | One | $4 | 07/19/22 |
Laren | One | $5 | 04/07/21 |
Laren | Two | $5 | 08/17/22 |
I'm looking to understand the flow of people and amount through the stages (the amounts are constant per person). So I need to transform the dataset into the following table:
From Stage | To Stage | Amount | Record Count |
---|---|---|---|
One | Two | $6 | 2 |
One | Three | $0 | 0 |
Two | One | $0 | 0 |
Two | Three | $1 | 1 |
Three | One | $4 | 1 |
Three | Two | $0 | 0 |
I'm looking to sum the amount only by person and take the number of people that moved from each stage to the other (all possibilities).
CodePudding user response:
# Convert to numerical:
df.Amount = df.Amount.str.lstrip('$').astype(int)
# Make Stage Categorical:
df.Stage = df.Stage.astype('category')
# Optional: Make sure Dates are sorted within each group.
# Wasn't needed in your sample data.
# df.Date = pd.to_datetime(df.Date)
# df = df.sort_values(['Name', 'Date'])
# Find Next Stage for each Stage:
df['Next_Stage'] = df.groupby('Name')['Stage'].shift(-1)
# Now when we pivot, all categories are represented:
out = (df.pivot_table(index=['Stage', 'Next_Stage'], values='Amount', aggfunc=['sum', 'count'])
.droplevel(1, 1) # Get rid of "Amount" header.
.reset_index() # Reset the Index.
[lambda x: x['Stage'].ne(x['Next_Stage'])]) # Remove Rows where Stages are the same.
print(out)
Output:
Stage Next_Stage sum count
1 One Three 0 0
2 One Two 6 2
3 Three One 4 1
5 Three Two 0 0
6 Two One 0 0
7 Two Three 1 1
CodePudding user response:
I went on a brute force approach to define the columns 'From Stage' and 'To Stage'
#redefine column as float
df['Amount'] = df['Amount'].apply(lambda x: x[1:]).astype('float')
#define new columns
df['From Stage'] = 'nil'
df['To Stage'] = 'nil'
for i in range(1, df.shape[0]):
if df.loc[i-1, 'Name'] == df.loc[i, 'Name']:
df.loc[i, 'From Stage'] = df.loc[i-1, 'Stage']
df.loc[i, 'To Stage'] = df.loc[i, 'Stage']
df.drop(df.index[df['From Stage']=='nil'], inplace=True)
print(df)
df1 = df.pivot_table(index=['From Stage', 'To Stage'], values='Amount', aggfunc=['sum', 'count'])
print(df1)
Output:
Name Stage Amount Date From Stage To Stage
1 Karen Two 1.0 08/12/21 One Two
2 Karen Three 1.0 05/03/22 Two Three
4 Jaren One 4.0 07/19/22 Three One
6 Laren Two 5.0 08/17/22 One Two
sum count
Amount$ Amount$
From Stage To Stage
One Two 6.0 2
Three One 4.0 1
Two Three 1.0 1
Edit: improved code
#redefine column as float, thanks @BeRT2me
df['Amount'] = df['Amount'].str[1:].astype('float')
#define new columns
df['From Name'] = df['Name'].shift(1)
df['From Stage'] = df['Stage'].shift(1)
df['To Stage'] = df['Stage']
df.drop(df.index[df['From Name']!=df['Name']], inplace=True)
print(df)
df1 = df.pivot_table(index=['From Stage', 'To Stage'], values='Amount', aggfunc=['sum', 'count'])
print(df1)
Output:
Name Stage Amount Date From Name From Stage To Stage
1 Karen Two 1.0 08/12/21 Karen One Two
2 Karen Three 1.0 05/03/22 Karen Two Three
4 Jaren One 4.0 07/19/22 Jaren Three One
6 Laren Two 5.0 08/17/22 Laren One Two
sum count
Amount Amount
From Stage To Stage
One Two 6.0 2
Three One 4.0 1
Two Three 1.0 1