I have a table like that:
PC | USER | Period |
---|---|---|
A | U1 | 2020/01 |
A | U2 | 2020/02 |
A | U3 | 2020/03 |
A | U4 | 2020/04 |
B | U1 | 2020/01 |
B | U2 | 2020/02 |
B | U2 | 2020/03 |
I need to create a "feature count" to check how many time "PC" found and "user" connected
now for a machine learning algorithm what is the best way to do that? or even a combination of them
- for every row show the max connection:
PC | USER | Period | Max Con Per PC | Max Con Per USER |
---|---|---|---|---|
A | U1 | 2020/01 | 4 | 2 |
A | U2 | 2020/02 | 4 | 3 |
A | U3 | 2020/03 | 4 | 1 |
A | U4 | 2020/04 | 4 | 1 |
B | U1 | 2020/01 | 3 | 2 |
B | U2 | 2020/02 | 3 | 3 |
B | U2 | 2020/03 | 3 | 3 |
- for every row show the COUNT OF the connection:
PC | USER | Period | Max Con Per PC | Max Con Per USER |
---|---|---|---|---|
A | U1 | 2020/01 | 1 | 1 |
A | U2 | 2020/02 | 2 | 1 |
A | U3 | 2020/03 | 3 | 1 |
A | U4 | 2020/04 | 4 | 1 |
B | U1 | 2020/01 | 1 | 2 |
B | U2 | 2020/02 | 2 | 2 |
B | U2 | 2020/03 | 3 | 3 |
CodePudding user response:
You can use groupby
then transform
and cumcount
:
# Use lazy groups
gpc = df.groupby('PC')['PC']
gus = df.groupby('USER')['Period']
df['Max Con Per PC'] = gpc.transform('size')
df['Count Con Per PC'] = gpc.cumcount() 1
df['Max Con Per USER'] = gus.transform('size')
df['Count Con Per USER'] = gus.cumcount() 1
Output:
>>> df
PC USER Period Max Con Per PC Count Con Per PC Max Con Per USER Count Con Per USER
0 A U1 2020/01 4 1 2 1
1 A U2 2020/02 4 2 3 1
2 A U3 2020/03 4 3 1 1
3 A U4 2020/04 4 4 1 1
4 B U1 2020/01 3 1 2 2
5 B U2 2020/02 3 2 3 2
6 B U2 2020/03 3 3 3 3