Python newbie here with a challenge I'm working to solve...
My goal is to iterate through a data frame and return what changed line by line. Here's what I have so far:
pseudo code (may not be correct method) step 1: set row 0 to an initial value step 2: compare row 1 to row 0, add changes to a list and record row number step 3: set current row to new initial step 4: compare row 2 to row 1, add changes to a list and record row number step 5: iterate through all rows step 6: return a table with changes and row index where change occurred
d = {
'col1' : [1, 1, 2, 2, 3],
'col2' : [1, 2, 2, 2, 2],
'col3' : [1, 1, 2, 2, 2]
}
df = pd.DataFrame(data=d)
def delta():
changes = []
initial = df.loc[0]
for row in df:
if row[i] != initial:
changes.append[i]
delta()
changes I expect to see: index 1: col2 changed from 1 to 2, 2 should be added to changes list index 2: col 1 and col3 changed from 1 to 2, both 2s should be added to changes list index 4: col 1 changed from 2 to 3, 3 should be added to changes list
CodePudding user response:
You can check where each of the columns have changed using the shift
method and then use a mask
to only get the ones that have changed
df.loc[:, 'col1_changed'] = df['col1'].mask(df['col1'].eq(df['col1'].shift()))
df.loc[:, 'col2_changed'] = df['col2'].mask(df['col2'].eq(df['col2'].shift()))
df.loc[:, 'col3_changed'] = df['col3'].mask(df['col3'].eq(df['col3'].shift()))
Once you have identified the changes, you can agg
them together
# We don't consider the first row
df.loc[0, ['col1_changed', 'col2_changed', 'col3_changed']] = [np.nan] * 3
df[['col1_changed', 'col2_changed', 'col3_changed']].astype('str').agg(','.join, axis=1).str.replace('nan', 'no change')
#0 no change,no change,no change
#1 no change,2.0,no change
#2 2.0,no change,2.0
#3 no change,no change,no change
#4 3.0,no change,no change
CodePudding user response:
You can use the pandas function diff()
which will already provide the increment compared to the previous row:
import pandas as pd
d = {
'col1' : [1, 1, 2, 2, 3],
'col2' : [1, 2, 2, 2, 2],
'col3' : [1, 1, 2, 2, 2]
}
df = pd.DataFrame(data=d)
def delta(df):
deltas = df.diff() # will convert to float because this is needed to set Nans in the first row
deltas.iloc[0] = df.iloc[0] # replace Nans in first row with original data from first row
deltas = deltas.astype(df.dtypes) # reset data types according to input data
filter = (deltas!=0).any(axis=1) # filter to use only those rows where all values are non zero
filter.iloc[0] = True # make sure the first row is included even if original data for first row held only zeros
deltas = deltas.loc[filter] # actually apply the filter
return deltas
print( delta(df) )
This prints:
col1 col2 col3
0 1 1 1
1 0 1 0
2 1 0 1
4 1 0 0
For comparison, the input data printed (print(df)
):
col1 col2 col3
0 1 1 1
1 1 2 1
2 2 2 2
3 2 2 2
4 3 2 2