I have multiple columns in my data frame but three columns are important.
id | Date | Answer |
---|---|---|
12222 | 2020-05-01 | N |
12222 | 2020-05-02 | Y |
12222 | 2020-05-03 | N |
12222 | 2020-05-04 | Y |
12223 | 2020-05-06 | Y |
12224 | 2020-05-07 | Y |
12224 | 2020-05-08 | Y |
12225 | 2020-05-09 | N |
12225 | 2020-05-09 | Y |
For each id that have multiple changes in their answer I need to find and count that change in a separate column N -> Y is a change or Y->N but it has to be for the same id.
Therefore N->Y->N->Y would be 3 changes. N->Y->N->N is only two changes.
I have sorted the data frame in ascending order using the below code and tried to do a count but only getting a count of the number of values, responding to that id rather than changes.
df_changes = df.sort_values(by=['id', 'Date'], ascending =[True,True])
df_changes_2 = df_changes.groupby(['id','answer']).size().reset_index(name="Count")
Possible Output:
id | Date | Answer | Count |
---|---|---|---|
12222 | 2020-05-01 | N | 3 |
12223 | 2020-05-06 | Y | 0 |
12224 | 2020-05-07 | Y | 0 |
12225 | 2020-05-09 | N | 1 |
CodePudding user response:
You can try transform
then drop_duplicates
df['Count'] = df.groupby('id', as_index=False)['Answer'].transform(lambda col: col.ne(col.shift()).sum() - 1)
print(df)
id Date Answer Count
0 12222 2020-05-01 N 3
1 12222 2020-05-02 Y 3
2 12222 2020-05-03 N 3
3 12222 2020-05-04 Y 3
4 12223 2020-05-06 Y 0
5 12224 2020-05-07 Y 0
6 12224 2020-05-08 Y 0
7 12225 2020-05-09 N 1
8 12225 2020-05-09 Y 1
df = df.drop_duplicates('id', keep='first')
print(df)
id Date Answer Count
0 12222 2020-05-01 N 3
4 12223 2020-05-06 Y 0
5 12224 2020-05-07 Y 0
7 12225 2020-05-09 N 1
CodePudding user response:
I think this can help you find the solution you need.
import pandas as pd
d = {'id' : ['12222', '12222', '12222', '12222', '12223', '12224', '12224', '12225', '12225']
,'Date' : ['2020-05-01', '2020-05-02', '2020-05-02', '2020-05-03', '2020-05-02', '2020-05-02', '2020-05-02', '2020-05-02', '2020-05-02'],
'Answer' : ['N', 'Y', 'N','Y', 'N', 'Y', 'Y', 'N', 'Y']}
df = pd.DataFrame(data = d)
col_ANSW = df['Answer'].tolist()
col_ID = df['id'].tolist()
cont = -1
y = ''
z = ''
cont_as = []
for x in range(0, len(col_ANSW)):
if ((col_ID[x] != z) and (x > 1)):
cont_as.append((str(col_ID[x]),str(cont)))
cont = -1
else:
cont_as.append((str(col_ID[x]),'0'))
if (col_ANSW[x] != y):
cont = 1
y = col_ANSW[x]
z = col_ID[x]
cont_as.append((str(col_ID[x]),str(cont)))
cont_as = pd.DataFrame(cont_as, columns = ['id','CONT'])
cont_as = cont_as.set_index('id')
df = pd.merge(df, cont_as, how = 'left', on = 'id')
print(str(df))