Python iterating through data and returning deltas-CodePudding

Python newbie here with a challenge I'm working to solve...

My goal is to iterate through a data frame and return what changed line by line. Here's what I have so far:

pseudo code (may not be correct method) step 1: set row 0 to an initial value step 2: compare row 1 to row 0, add changes to a list and record row number step 3: set current row to new initial step 4: compare row 2 to row 1, add changes to a list and record row number step 5: iterate through all rows step 6: return a table with changes and row index where change occurred

d = {
'col1' : [1, 1, 2, 2, 3],
'col2' : [1, 2, 2, 2, 2],
'col3' : [1, 1, 2, 2, 2]
}

df = pd.DataFrame(data=d)

def delta():
    changes = []
    initial = df.loc[0]
    for row in df:
         if row[i] != initial:
              changes.append[i]
delta()

changes I expect to see: index 1: col2 changed from 1 to 2, 2 should be added to changes list index 2: col 1 and col3 changed from 1 to 2, both 2s should be added to changes list index 4: col 1 changed from 2 to 3, 3 should be added to changes list

CodePudding user response：

You can check where each of the columns have changed using the shift method and then use a mask to only get the ones that have changed

df.loc[:, 'col1_changed'] = df['col1'].mask(df['col1'].eq(df['col1'].shift()))
df.loc[:, 'col2_changed'] = df['col2'].mask(df['col2'].eq(df['col2'].shift()))
df.loc[:, 'col3_changed'] = df['col3'].mask(df['col3'].eq(df['col3'].shift()))

Once you have identified the changes, you can agg them together

# We don't consider the first row
df.loc[0, ['col1_changed', 'col2_changed', 'col3_changed']] = [np.nan] * 3
df[['col1_changed', 'col2_changed', 'col3_changed']].astype('str').agg(','.join, axis=1).str.replace('nan', 'no change')
#0    no change,no change,no change
#1          no change,2.0,no change
#2                2.0,no change,2.0
#3    no change,no change,no change
#4          3.0,no change,no change

CodePudding user response：

You can use the pandas function diff() which will already provide the increment compared to the previous row:

import pandas as pd

d = {
'col1' : [1, 1, 2, 2, 3],
'col2' : [1, 2, 2, 2, 2],
'col3' : [1, 1, 2, 2, 2]
}

df = pd.DataFrame(data=d)

def delta(df):
    deltas = df.diff()                # will convert to float because this is needed to set Nans in the first row
    deltas.iloc[0] = df.iloc[0]       # replace Nans in first row with original data from first row
    deltas = deltas.astype(df.dtypes) # reset data types according to input data
    filter = (deltas!=0).any(axis=1)  # filter to use only those rows where all values are non zero
    filter.iloc[0] = True             # make sure the first row is included even if original data for first row held only zeros
    deltas = deltas.loc[filter]       # actually apply the filter
    return deltas

print( delta(df) )

This prints:

   col1  col2  col3
0     1     1     1
1     0     1     0
2     1     0     1
4     1     0     0

For comparison, the input data printed (print(df)):

   col1  col2  col3
0     1     1     1
1     1     2     1
2     2     2     2
3     2     2     2
4     3     2     2