I have a dataframe as such -
| ID | Status | Data | Column |
|:----:|:------:|:-------------:|:-------|
| 1 | A | Old |111 |
| 2 | B | Old |222 |
| 3 | C | Old |333 |
| 2 | C | New |222 |
| 3 | D | New |333 |
| 1 | E | New |111 |
| 4 | E | New |444 |
| 4 | E | Old |444 |
I want the output to highlight changes based on ID and Data like this , this should be regardless of IDs and Data columns position in row as given in sample above.
So, the result should look like this -
| ID | Change Status | Data | Final Status| Column |
|:----:|:-------------:|:-------------:|:-----------:|:------:|
| 1 | A -> E | New | E | 111 |
| 2 | B -> C | New | C | 222 |
| 3 | C -> D | New | D | 333 |
| 4 | No Change | New | E | 444 |
CodePudding user response:
I would pivot this dataframe to get the changes:
df = df.pivot(index=['ID', 'Column'], columns='Data', values='Status').reset_index()
Data ID Column New Old
0 1 111 E A
1 2 222 C B
2 3 333 D C
3 4 444 E E
You can then add in the other bits to adjust this into the desired format.
To get the 'Change Status'
column, you can use numpy's select, which will assign a different value if the old and new columns match or not, as below:
values = [df['Old'] ' -> ' df['New'], 'No Change']
conditions = [df['New'] != df['Old'], df['New'] == df['Old']]
df['Change Status'] = np.select(conditions, values)
Data ID Column New Old Change Status
0 1 111 E A A -> E
1 2 222 C B B -> C
2 3 333 D C C -> D
3 4 444 E E No Change
Then the 'Data'
column is just populated with 'New', so this can be done using:
df['Data'] = 'New'
Data ID Column New Old Change Status Data
0 1 111 E A A -> E New
1 2 222 C B B -> C New
2 3 333 D C C -> D New
3 4 444 E E No Change New
The 'Final Status'
column is just the 'New'
column, so you can rename this:
df = df.rename(columns={'New':'Final Status'})
Data ID Column Final Status Old Change Status Data
0 1 111 E A A -> E New
1 2 222 C B B -> C New
2 3 333 D C C -> D New
3 4 444 E E No Change New
Now you can just select the columns you want, in the order above:
df = df[['ID', 'Change Status', 'Data', 'Final Status', 'Column']]
Data ID Change Status Data Final Status Column
0 1 A -> E New E 111
1 2 B -> C New C 222
2 3 C -> D New D 333
3 4 No Change New E 444
(If you want ID to be a column not an index, you can just use .reset_index()
)
CodePudding user response:
IIUC you can split the data by grouping them and then merge them together.
(idx1, grp1), (idx2, grp2) = df.groupby('Data')
out = grp2[['ID', 'Status']].merge(grp1, on='ID', suffixes=['_Change', '_Final'])
print(out)
ID Status_Change Status_Final Data Column
0 1 A E New 111
1 2 B C New 222
2 3 C D New 333
3 4 E E New 444