Home > Blockchain >  Pandas groupby() different output with versions 0.23.4 and 1.3.4
Pandas groupby() different output with versions 0.23.4 and 1.3.4

Time:02-16

I have 2 codebases with the same code, the only difference is the version of pandas being used:

  • OLD environment uses pandas version 0.23.4
  • NEW environment uses pandas version 1.3.4

I have debugged my code up to this line of code, after which the result is different:

result = df.groupby(group_items, as_index=as_index, sort=sort)[sum_items].sum()

Variables df, group_items, as_index, sort and sum_items are all exactly the same between both NEW and OLD environments.

However, the returned result is a little bit different in the NEW version. Specifically, the output looks like this:

NEW environment:

df.groupby(group_items, as_index=as_index, sort=sort)[sum_items].sum()
        SST_ADJ_TYPE    SST_ADJ_RULE  ... NCI     AMOUNT
0                  0  SST22a,SST22b,  ...      1874757.0
1                  0  SST22a,SST22b,  ...      5945263.0
2                  0  SST22a,SST22b,  ...      4303110.0
3                  0  SST22a,SST22b,  ...      5342991.0
4                  0  SST22a,SST22b,  ...      9245478.0
...              ...             ...  ...  ..        ...
133674             3   SST22b,SST07,  ...      4164305.0
133675             3   SST22b,SST07,  ...      7280203.0
133676             3   SST22b,SST07,  ...      1235752.0
133677             3   SST22b,SST07,  ...      3115825.0
133678             3   SST22b,SST07,  ...      1436891.0
[133679 rows x 16 columns]

OLD environment:

df.groupby(group_items, as_index=as_index, sort=sort)[sum_items].sum()
        SST_ADJ_TYPE    SST_ADJ_RULE    ...     NCI     AMOUNT
0                  0  SST22a,SST22b,    ...          1874757.0
1                  0  SST22a,SST22b,    ...          5945263.0
2                  0  SST22a,SST22b,    ...          4303110.0
3                  0  SST22a,SST22b,    ...          5342991.0
4                  0  SST22a,SST22b,    ...          9245478.0
5                  0  SST22a,SST22b,    ...          4016202.0
6                  0  SST22a,SST22b,    ...          8799969.0
7                  0  SST22a,SST22b,    ...          1503269.0
8                  0  SST22a,SST22b,    ...          6385991.0
9                  0  SST22a,SST22b,    ...          1686520.0
10                 0  SST22a,SST22b,    ...          5287114.0
11                 0  SST22a,SST22b,    ...          2648534.0
12                 0  SST22a,SST22b,    ...          6159017.0
13                 0  SST22a,SST22b,    ...          5959591.0
14                 0  SST22a,SST22b,    ...          5809998.0
15                 0  SST22a,SST22b,    ...          4929077.0
16                 0  SST22a,SST22b,    ...          9166004.0
17                 0  SST22a,SST22b,    ...          2124498.0
18                 0  SST22a,SST22b,    ...          3051659.0
19                 0  SST22a,SST22b,    ...          1859001.0
20                 0  SST22a,SST22b,    ...          8522834.0
21                 0  SST22a,SST22b,    ...          7803526.0
22                 0  SST22a,SST22b,    ...          4067546.0
23                 0  SST22a,SST22b,    ...          9218486.0
24                 0  SST22a,SST22b,    ...          1453153.0
25                 0  SST22a,SST22b,    ...          7411706.0
26                 0  SST22a,SST22b,    ...          9160444.0
27                 0  SST22a,SST22b,    ...          6255426.0
28                 0  SST22a,SST22b,    ...          6007841.0
29                 0  SST22a,SST22b,    ...          4744588.0
...              ...             ...    ...      ..        ...
133649             3   SST22b,SST07,    ...          6487572.0
133650             3   SST22b,SST07,    ...          3593805.0
133651             3   SST22b,SST07,    ...          9192954.0
133652             3   SST22b,SST07,    ...          2394981.0
133653             3   SST22b,SST07,    ...          9398971.0
133654             3   SST22b,SST07,    ...          5536294.0
133655             3   SST22b,SST07,    ...          8759613.0
133656             3   SST22b,SST07,    ...          2012212.0
133657             3   SST22b,SST07,    ...          7930551.0
133658             3   SST22b,SST07,    ...          3407871.0
133659             3   SST22b,SST07,    ...          3071541.0
133660             3   SST22b,SST07,    ...          1863129.0
133661             3   SST22b,SST07,    ...          8439646.0
133662             3   SST22b,SST07,    ...          1518097.0
133663             3   SST22b,SST07,    ...          7396702.0
133664             3   SST22b,SST07,    ...          8470274.0
133665             3   SST22b,SST07,    ...          8363095.0
133666             3   SST22b,SST07,    ...          1115614.0
133667             3   SST22b,SST07,    ...          6317772.0
133668             3   SST22b,SST07,    ...          2645613.0
133669             3   SST22b,SST07,    ...          6555039.0
133670             3   SST22b,SST07,    ...          5274987.0
133671             3   SST22b,SST07,    ...          5779789.0
133672             3   SST22b,SST07,    ...          6974948.0
133673             3   SST22b,SST07,    ...          6370779.0
133674             3   SST22b,SST07,    ...          4164305.0
133675             3   SST22b,SST07,    ...          7280203.0
133676             3   SST22b,SST07,    ...          1235752.0
133677             3   SST22b,SST07,    ...          1436891.0
133678             3   SST22b,SST07,    ...          3115825.0
[133679 rows x 16 columns]

As you can see, the amount of rows and columns is the same. The columns are also exactly the same between the two results. However, when you check the AMOUNT column, you see that, for example in the last rows, the result from the NEW environment has the values combined (last row swapped for the previous row, for example).

Any ideas why is this happening?

PS: Unfortunately, I can not provide a DataFrame which you can load since the DataFrame I'm using has lots of data in it. I'm more of looking to a theoretical answer on what changed between the above mentioned versions of pandas and/or which argument to use in the NEW environment to have the exact same result as in the OLD environment.

CodePudding user response:

well, your data is not sorted and seems like in those versions , pandas returns data in different order. from Pandas documentations :

sort in group by just sort the group keys and this does not influence the order of observations within each group.

you can sort your data by .sort_values() after group by , see docs for more detail

  • Related