I want to create groups of similar data in rows and put numbers in another column that tells me the number of times this unique combination of data has occurred. I'm new to using pandas, and I was wondering if there is an easy way to calculate this.
As an example, I have these rows:
PID | fileName | line | revisit |
---|---|---|---|
P1 | f1.py | 10 | 1 |
P1 | f1.py | 10 | 1 |
P1 | f1.py | 10 | 1 |
P1 | f1.py | 10 | 1 |
P2 | f2.py | 2 | 1 |
P2 | f2.py | 2 | 1 |
P1 | f1.py | 10 | 2 |
P1 | f1.py | 10 | 2 |
P1 | f1.py | 10 | 2 |
P2 | f2.py | 2 | 2 |
P2 | f2.py | 2 | 2 |
P1 | f1.py | 10 | 3 |
P1 | f1.py | 10 | 3 |
The important information for me is on the PID, fileName, and line. For the consecutive rows of data that have similar values for these columns, I want a number to be assigned in revisit. And the next time these unique sets of values are seen again, revisit should be incremented.
I don't have a code example because I'm unsure how to do this using pandas functions.
CodePudding user response:
try this:
def func(g: pd.DataFrame):
revisit = g.index.to_series().diff().ne(1).cumsum()
return revisit
grouped = df.groupby(['PID', 'fileName'], as_index=False)
df['revisit'] = grouped.apply(func, as_index=False).droplevel(0)
print(df)
>>>
PID fileName line revisit
0 P1 f1.py 10 1
1 P1 f1.py 10 1
2 P1 f1.py 10 1
3 P1 f1.py 10 1
4 P2 f2.py 2 1
5 P2 f2.py 2 1
6 P1 f1.py 10 2
7 P1 f1.py 10 2
8 P1 f1.py 10 2
9 P2 f2.py 2 2
10 P2 f2.py 2 2
11 P1 f1.py 10 3
12 P1 f1.py 10 3
CodePudding user response:
Let us do transform
with factorize
df['revisit'] = df.PID.ne(df.PID.shift()).cumsum().groupby(df.PID).transform(lambda x : x.factorize()[0] 1)
df
Out[189]:
PID fileName line revisit
0 P1 f1.py 10 1
1 P1 f1.py 10 1
2 P1 f1.py 10 1
3 P1 f1.py 10 1
4 P2 f2.py 2 1
5 P2 f2.py 2 1
6 P1 f1.py 10 2
7 P1 f1.py 10 2
8 P1 f1.py 10 2
9 P2 f2.py 2 2
10 P2 f2.py 2 2
11 P1 f1.py 10 3
12 P1 f1.py 10 3