Home > other >  Pandas or other Python application to generate a column with 1 to n value based on other two columns
Pandas or other Python application to generate a column with 1 to n value based on other two columns

Time:03-16

Hope I can explain the question properly.

In basic terms, imagining the df as below:

print(df)

year        id  
1           16100
1           150 
1           150
2           66
2           370
2           370 
2           530
3           41
3           43  
3           61

Would need df.seq to be a cycling 1 to n value if the year rows are identical, until it changes. df.seq2 would be still n, instead of n 1, if the above rows id value is identical.

So if we imagine excel like formula would be something like

df.seq2 = IF(A2=A1,IF(B2=B1,F1,F1 1),1)

which would make the desired output seq and seq2 below:

year        id      seq   seq2
1           16100    1     1
1           150      2     2
1           150      3     2
2           66       1     1
2           370      2     2
2           370      3     2
2           530      4     3
3           41       1     1
3           43       2     2
3           61       3     3

Did test couple things like (assuming I've generated the df.seq)


comb_df['match'] = comb_df.year.eq(comb_df.year.shift())
comb_df['match2'] = comb_df.id.eq(comb_df.id.shift())


comb_df["seq2"] = np.where((comb_df["match"].shift( 1) == True) & (comb_df["match2"].shift( 1) == True), comb_df["seq"] - 1, comb_df["seq2"])

But the problem is this doesn't really work out if there are multiple duplicates in a row etc.

Perhaps issue can not be resolved purely on numpy sort of way but perhaps I'd have to iterate over the rows?

There are 2-3 million rows, so the performance will be an issue if the solution would be very slow.

Would need to generate both df.seq and df.seq2

Any ideas would be extremely helpful!

CodePudding user response:

We can do groupby with cumcount and factorize

df['seq'] = df.groupby('year').cumcount() 1
df['seq2'] = df.groupby('year')['id'].transform(lambda x : x.factorize()[0] 1)
df
Out[852]: 
   year     id  seq  seq2
0     1  16100    1     1
1     1    150    2     2
2     1    150    3     2
3     2     66    1     1
4     2    370    2     2
5     2    370    3     2
6     2    530    4     3
7     3     41    1     1
8     3     43    2     2
9     3     61    3     3
  • Related