Counting the frequency of change within a variable-CodePudding

I hope to identify the frequency of change in 'type' (dummy variable) by 'id' across quarters. For example, person 1 has switched its type from 'a' to 'b', and then back to 'a' from 2019 to 2020. So this person switches twice. Person 2 has switched only once. Person 3 never switches. Then, I hope to generate add a column called "frequency" to record the number of changes across quarters. Under the frequency column, person 1 should have 2, person 2 has 1, and person 3 has 0.

I am quite new to python and do not have any existing code now. Thank you for your help!

year	quarter	type	person id
2020	q1	a	1
2020	q2	b	1
2019	q1	a	1
2019	q1	a	2
2019	q4	b	2
2019	q1	a	3
2019	q4	a	3

CodePudding user response：

Well I managed to do what was asked of. Sortof. I agree the code can be further eased so as to make it more simpler but I did what I could for now. So basically, what I did was first make sense of all the data that was availiable. At first I made room for the 8 quarters between 2019-2020 but as the data was only for some of it, I ditched that idea.

So first I made an empty list l with 7 empty lists inside it to store the data:

l = []
d = [[],[],[]]
for i in range(7):
        l.append(d)

Which gave me:

[[[], [], []],
 [[], [], []],
 [[], [], []],
 [[], [], []],
 [[], [], []],
 [[], [], []],
 [[], [], []]]

Also, the presence or absence of the year and quarters are irrelevant to the calculation of frequency for each person as only the type and person id are contributing to that calculation. But they may be required later on, so keeping that in mind I used them still. But like I mentioned not seeing any use, I did make a different approach to store them, that is I denoted 2019 as 1 because it is the 1st year and quarters as 1,2,3,4 and hence 2020 as 2. For example the first data is : In 2020 for quarter 4 person 1 used a. I stored it in the list as:

l[0] = [21,'a',1 ]

Wherein, 21 refers to 2: 2020, 2: quarter 1, 'a' well to the variable a and 1 to the person. Quite simple actually.

Similarily, I repeated the same for the rest to get:

l[0] = [21,'a',1 ]
l[1] = [21,'b',1 ]
l[2] = [11,'a',1 ]
l[3] = [11,'a',2 ]
l[4] = [11,'b',2 ]
l[5] = [11,'a',3 ]
l[6] = [11,'a',3 ]

Now I had stored them in the list l and it looked like this:

l = 
[[21, 'a', 1],
 [21, 'b', 1],
 [11, 'a', 1],
 [11, 'a', 2],
 [11, 'b', 2],
 [11, 'a', 3],
 [11, 'a', 3]]

At last, the final step, making of the frequency list named f, short for frequency. For this I made a list with 3 lists in it, pertaining to each of the 3 persons. The inner list has two elements, the first one is the count of how much each person has changed and the second one is for the current type the person has. Now, to make it easier to calculate it, I gave everyone a -1 as the first element and a type 'c' as the second element:

f = [[-1,'c'],[-1,'c'],[-1,'c']]

Why you ask, because I coded it in such a way that whenever the person changes a type his frequency will get incremented, so now because the type is in either a or b so his frequency will definitely get incremented and so it would be 0, that saves us the trouble to check and input the first type for each person before beginning the loop.

Now, to calculate the frequency and thereby solve this problem. This is the easy part, in the list f, index 0 is for person 1, 1 for 2, and 2 for 3. Also, in the list l, in every inner loop, the third element is of the person itself. Ok, so? Well, if we access that third element and subtract 1 from it, we basically get the index for that person in the frequency list. What I mean is, l[0] = [21,'a',1 ], so, l[0][2] = 1. And if I subtract 1 from this, we get 0, which is the index for the first person in the frequency table. Now, what we do is, loop across the list l, and first check if the type of the person in the frequency table matches to the one in the list, if it does, then we move on the the next, but if it doesn't, we increment the first element by 1, as frequency has increased and we update the type of the second element.

And that is it:

for i in range(7):
    pid = l[i][2] 
    if f[pid-1][1] != l[i][1]:
        f[pid-1][1] = l[i][1]
        f[pid-1][0] = f[pid-1][0]   1

And after running the entire code which is:

l = []
d = [[],[],[]]
for i in range(7):
        l.append(d)
l[0] = [21,'a',1 ]
l[1] = [21,'b',1 ]
l[2] = [11,'a',1 ]
l[3] = [11,'a',2 ]
l[4] = [11,'b',2 ]
l[5] = [11,'a',3 ]
l[6] = [11,'a',3 ]
f = [[-1,'c'],[-1,'c'],[-1,'c']]
for i in range(7):
    pid = l[i][2] 
    if f[pid-1][1] != l[i][1]:
        f[pid-1][1] = l[i][1]
        f[pid-1][0] = f[pid-1][0]   1
for i in f:
    print(f"The person {i[1]} has changed his type {i[0]} times.")

We finally get :

The person a has changed his type 2 times.
The person b has changed his type 1 times.
The person a has changed his type 0 times.

CodePudding user response：

You have tagged your question with dataframe: Do you mean a Pandas-DataFrame? If so:

Your dataframe:

df = 
   year quarter type  person id
0  2020      q1    a          1
1  2020      q2    b          1
2  2019      q1    a          1
3  2019      q1    a          2
4  2019      q4    b          2
5  2019      q1    a          3
6  2019      q4    a          3

Result for

df_freq = (
    df.sort_values(["person id", "year", "quarter"])
      .groupby("person id", as_index=False).type
      .apply(lambda col: (col != col.shift()).sum() - 1)
      .rename(columns={"type": "frequency"})
)

   person id  frequency
0          1          1
1          2          1
2          3          0

But I don't understand why "person 1 should have 2"? If you look at the chronological ordered (per person id) dataframe

   year quarter type  person id
0  2019      q1    a          1
1  2020      q1    a          1
2  2020      q2    b          1
3  2019      q1    a          2
4  2019      q4    b          2
5  2019      q1    a          3
6  2019      q4    a          3

there's only 1 change?

If you don't sort

df_freq = (
    df.groupby("person id", as_index=False).type
      .apply(lambda col: (col != col.shift()).sum() - 1)
      .rename(columns={"type": "frequency"})
)

the result is

   person id  frequency
0          1          2
1          2          1
2          3          0