Hope I can explain the question properly.
In basic terms, imagining the df as below:
print(df)
year id
1 16100
1 150
1 150
2 66
2 370
2 370
2 530
3 41
3 43
3 61
Would need df.seq to be a cycling 1 to n value if the year rows are identical, until it changes. df.seq2 would be still n, instead of n 1, if the above rows id value is identical.
So if we imagine excel like formula would be something like
df.seq2 = IF(A2=A1,IF(B2=B1,F1,F1 1),1)
which would make the desired output seq and seq2 below:
year id seq seq2
1 16100 1 1
1 150 2 2
1 150 3 2
2 66 1 1
2 370 2 2
2 370 3 2
2 530 4 3
3 41 1 1
3 43 2 2
3 61 3 3
Did test couple things like (assuming I've generated the df.seq)
comb_df['match'] = comb_df.year.eq(comb_df.year.shift())
comb_df['match2'] = comb_df.id.eq(comb_df.id.shift())
comb_df["seq2"] = np.where((comb_df["match"].shift( 1) == True) & (comb_df["match2"].shift( 1) == True), comb_df["seq"] - 1, comb_df["seq2"])
But the problem is this doesn't really work out if there are multiple duplicates in a row etc.
Perhaps issue can not be resolved purely on numpy sort of way but perhaps I'd have to iterate over the rows?
There are 2-3 million rows, so the performance will be an issue if the solution would be very slow.
Would need to generate both df.seq and df.seq2
Any ideas would be extremely helpful!
CodePudding user response:
We can do groupby
with cumcount
and factorize
df['seq'] = df.groupby('year').cumcount() 1
df['seq2'] = df.groupby('year')['id'].transform(lambda x : x.factorize()[0] 1)
df
Out[852]:
year id seq seq2
0 1 16100 1 1
1 1 150 2 2
2 1 150 3 2
3 2 66 1 1
4 2 370 2 2
5 2 370 3 2
6 2 530 4 3
7 3 41 1 1
8 3 43 2 2
9 3 61 3 3