Home > database >  Highlight Changes in the same Column in new Dataframe
Highlight Changes in the same Column in new Dataframe

Time:09-14

I have a dataframe as such -

| ID   | Status | Data          | Column |
|:----:|:------:|:-------------:|:-------|
| 1    | A      | Old           |111     |
| 2    | B      | Old           |222     |
| 3    | C      | Old           |333     |
| 2    | C      | New           |222     |
| 3    | D      | New           |333     |
| 1    | E      | New           |111     |
| 4    | E      | New           |444     |
| 4    | E      | Old           |444     |

I want the output to highlight changes based on ID and Data like this , this should be regardless of IDs and Data columns position in row as given in sample above.

So, the result should look like this -

| ID   | Change Status | Data          | Final Status| Column | 
|:----:|:-------------:|:-------------:|:-----------:|:------:|
| 1    | A    -> E     | New           | E           | 111    | 
| 2    | B    -> C     | New           | C           | 222    |
| 3    | C    -> D     | New           | D           | 333    |
| 4    | No Change     | New           | E           | 444    |

CodePudding user response:

I would pivot this dataframe to get the changes:

df = df.pivot(index=['ID', 'Column'], columns='Data', values='Status').reset_index()

    Data  ID  Column New Old
0      1     111   E   A
1      2     222   C   B
2      3     333   D   C
3      4     444   E   E

You can then add in the other bits to adjust this into the desired format.

To get the 'Change Status' column, you can use numpy's select, which will assign a different value if the old and new columns match or not, as below:

values = [df['Old']   ' -> '   df['New'], 'No Change']
conditions = [df['New'] != df['Old'], df['New'] == df['Old']]
df['Change Status'] = np.select(conditions, values)

    Data  ID  Column New Old Change Status
0      1     111   E   A        A -> E
1      2     222   C   B        B -> C
2      3     333   D   C        C -> D
3      4     444   E   E     No Change

Then the 'Data' column is just populated with 'New', so this can be done using:

df['Data'] = 'New'

    Data  ID  Column New Old Change Status Data
0      1     111   E   A        A -> E  New
1      2     222   C   B        B -> C  New
2      3     333   D   C        C -> D  New
3      4     444   E   E     No Change  New

The 'Final Status' column is just the 'New' column, so you can rename this:

df = df.rename(columns={'New':'Final Status'})

    Data  ID  Column Final Status Old Change Status Data
0      1     111            E   A        A -> E  New
1      2     222            C   B        B -> C  New
2      3     333            D   C        C -> D  New
3      4     444            E   E     No Change  New

Now you can just select the columns you want, in the order above:

df = df[['ID', 'Change Status', 'Data', 'Final Status', 'Column']]

    
    Data  ID Change Status Data Final Status  Column
0      1        A -> E  New            E     111
1      2        B -> C  New            C     222
2      3        C -> D  New            D     333
3      4     No Change  New            E     444

(If you want ID to be a column not an index, you can just use .reset_index())

CodePudding user response:

IIUC you can split the data by grouping them and then merge them together.

(idx1, grp1), (idx2, grp2) = df.groupby('Data')

out = grp2[['ID', 'Status']].merge(grp1, on='ID', suffixes=['_Change', '_Final'])
print(out)
   ID Status_Change Status_Final Data  Column
0   1             A            E  New     111
1   2             B            C  New     222
2   3             C            D  New     333
3   4             E            E  New     444
  • Related