How to iterate rows and check for multiple changes in column?-CodePudding

I have multiple columns in my data frame but three columns are important.

id	Date	Answer
12222	2020-05-01	N
12222	2020-05-02	Y
12222	2020-05-03	N
12222	2020-05-04	Y
12223	2020-05-06	Y
12224	2020-05-07	Y
12224	2020-05-08	Y
12225	2020-05-09	N
12225	2020-05-09	Y

For each id that have multiple changes in their answer I need to find and count that change in a separate column N -> Y is a change or Y->N but it has to be for the same id.

Therefore N->Y->N->Y would be 3 changes. N->Y->N->N is only two changes.

I have sorted the data frame in ascending order using the below code and tried to do a count but only getting a count of the number of values, responding to that id rather than changes.

df_changes = df.sort_values(by=['id', 'Date'], ascending =[True,True])
df_changes_2 = df_changes.groupby(['id','answer']).size().reset_index(name="Count")

Possible Output:

id	Date	Answer	Count
12222	2020-05-01	N	3
12223	2020-05-06	Y	0
12224	2020-05-07	Y	0
12225	2020-05-09	N	1

CodePudding user response：

You can try transform then drop_duplicates

df['Count'] = df.groupby('id', as_index=False)['Answer'].transform(lambda col: col.ne(col.shift()).sum() - 1)

print(df)

      id        Date Answer  Count
0  12222  2020-05-01      N      3
1  12222  2020-05-02      Y      3
2  12222  2020-05-03      N      3
3  12222  2020-05-04      Y      3
4  12223  2020-05-06      Y      0
5  12224  2020-05-07      Y      0
6  12224  2020-05-08      Y      0
7  12225  2020-05-09      N      1
8  12225  2020-05-09      Y      1

df = df.drop_duplicates('id', keep='first')

print(df)

      id        Date Answer  Count
0  12222  2020-05-01      N      3
4  12223  2020-05-06      Y      0
5  12224  2020-05-07      Y      0
7  12225  2020-05-09      N      1

CodePudding user response：

I think this can help you find the solution you need.

import pandas as pd

d = {'id' : ['12222', '12222', '12222', '12222', '12223', '12224', '12224', '12225', '12225']
,'Date' : ['2020-05-01', '2020-05-02', '2020-05-02', '2020-05-03', '2020-05-02', '2020-05-02', '2020-05-02', '2020-05-02', '2020-05-02'],
'Answer' : ['N', 'Y', 'N','Y', 'N', 'Y', 'Y', 'N', 'Y']}

df = pd.DataFrame(data = d)



col_ANSW = df['Answer'].tolist()
col_ID = df['id'].tolist()
cont = -1
y = ''
z = ''
cont_as = []
for x in range(0, len(col_ANSW)):
    
    if ((col_ID[x] != z) and (x > 1)):
        cont_as.append((str(col_ID[x]),str(cont)))
        cont = -1
    else:
        cont_as.append((str(col_ID[x]),'0'))
    if (col_ANSW[x] != y):
        cont  = 1
    y = col_ANSW[x]
    z = col_ID[x]
    
cont_as.append((str(col_ID[x]),str(cont)))
cont_as = pd.DataFrame(cont_as, columns = ['id','CONT'])
cont_as = cont_as.set_index('id')
df = pd.merge(df, cont_as, how = 'left', on = 'id')

print(str(df))