So I have the following CSV data. If you look at the columns, PPID is the parent process ID and PID is the process ID. I want to update my existing dataframe so that i can add a new column called PPIDName with the corresponding name of the process rather than an ID. How can I go about doing this?
Following is an example:
PID of services.exe is 768. PPID of svchost.exe is PPID as 768 (which is services.exe). I want to make a new column in this so that for every row I print out the actual name of the parent process rather than its PPID
"TreeDepth","PID","PPID","ImageFileName","Offset(V)","Threads","Handles","SessionId","Wow64","CreateTime","ExitTime"
1,768,632,"services.exe","0xac8190e52100",7,,0,False,"2021-04-01 05:05:01.000000 ",
2,1164,768,"svchost.exe","0xac8191053340",3,,0,False,"2021-04-01 05:05:02.000000 ",
"TreeDepth","PID","PPID","ImageFileName","Offset(V)","Threads","Handles","SessionId","Wow64","CreateTime","ExitTime"
0,4,0,"System","0xac818d45d080",158,,,False,"2021-04-01 05:04:58.000000 ",
1,88,4,"Registry","0xac818d5ab040",4,,,False,"2021-04-01 05:04:54.000000 ",
1,404,4,"smss.exe","0xac818dea7040",2,,,False,"2021-04-01 05:04:58.000000 ",
0,556,548,"csrss.exe","0xac81900e4140",10,,0,False,"2021-04-01 05:05:00.000000 ",
0,632,548,"wininit.exe","0xac81901ee080",1,,0,False,"2021-04-01 05:05:00.000000 ",
1,768,632,"services.exe","0xac8190e52100",7,,0,False,"2021-04-01 05:05:01.000000 ",
2,1152,768,"svchost.exe","0xac8191034300",2,,0,False,"2021-04-01 05:05:02.000000 ",
2,2560,768,"svchost.exe","0xac8191485080",6,,0,False,"2021-04-01 05:05:03.000000 ",
2,1668,768,"svchost.exe","0xac8191238080",6,,0,False,"2021-04-01 05:05:03.000000 ",
2,1924,768,"svchost.exe","0xac819132b340",6,,0,False,"2021-04-01 05:05:03.000000 ",
2,908,768,"svchost.exe","0xac8190076080",1,,0,False,"2021-04-01 05:05:01.000000 ",
2,1164,768,"svchost.exe","0xac8191053340",3,,0,False,"2021-04-01 05:05:02.000000 ",
2,2956,768,"svchost.exe","0xac81915d5080",3,,0,False,"2021-04-01 05:05:04.000000 ",
2,652,768,"svchost.exe","0xac8194af2080",11,,0,False,"2021-04-05 21:59:50.000000 ",
2,1680,768,"svchost.exe","0xac819123a700",9,,0,False,"2021-04-01 05:05:03.000000 ",
2,1172,768,"svchost.exe","0xac8191055380",4,,0,False,"2021-04-01 05:05:02.000000 ",
2,2964,768,"svchost.exe","0xac819163e080",7,,0,False,"2021-04-01 05:05:04.000000 ",
2,4500,768,"svchost.exe","0xac8192760080",4,,0,False,"2021-04-01 05:48:25.000000 ",
2,2196,768,"svchost.exe","0xac8191ff0080",4,,0,False,"2021-04-02 01:20:04.000000 ",
2,2456,768,"svchost.exe","0xac8191333080",6,,0,False,"2021-04-01 05:05:03.000000 ",
2,1688,768,"svchost.exe","0xac819267c2c0",7,,0,False,"2021-04-01 05:48:24.000000 ",
2,1180,768,"svchost.exe","0xac8191058700",4,,0,False,"2021-04-01 05:05:02.000000 ",
2,2588,768,"spoolsv.exe","0xac81914db0c0",15,,0,False,"2021-04-01 05:05:03.000000 ",
2,2716,768,"svchost.exe","0xac8192615340",4,,2,False,"2021-04-01 05:48:24.000000 ",
CodePudding user response:
This does the job,
ppid_name = df.loc[df["PID"].isin(df["PPID"]), ["PID", "ImageFileName"]].set_index("PID", drop = False)
replace_with = (ppid_name["PID"].astype(str) "_" ppid_name["ImageFileName"]).to_dict()
df["PPID"] = df["PPID"].replace(replace_with)
Output -
TreeDepth | PID | PPID | ImageFileName | Offset(V) | Threads | Handles | SessionId | Wow64 | CreateTime | ExitTime | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 4 | 0 | System | 0xac818d45d080 | 158 | nan | nan | False | 2021-04-01 05:04:58.000000 | nan |
1 | 1 | 88 | 4_System | Registry | 0xac818d5ab040 | 4 | nan | nan | False | 2021-04-01 05:04:54.000000 | nan |
2 | 1 | 404 | 4_System | smss.exe | 0xac818dea7040 | 2 | nan | nan | False | 2021-04-01 05:04:58.000000 | nan |
3 | 0 | 556 | 548 | csrss.exe | 0xac81900e4140 | 10 | nan | 0.0 | False | 2021-04-01 05:05:00.000000 | nan |
4 | 0 | 632 | 548 | wininit.exe | 0xac81901ee080 | 1 | nan | 0.0 | False | 2021-04-01 05:05:00.000000 | nan |
CodePudding user response:
I think I understand what you're after.
I've made a smaller df with only the relevant columns for my answer (so you can assume Another Col replaces all the other columns):
PID PPID ImageFileName Another Col
0 4 0 System 1
1 88 4 Registry 2
2 404 4 smss.exe 3
3 556 548 csrss.exe 4
4 632 548 wininit.exe 5
...
Firstly, I got all of the PIDs with their corresponding name, and removed any duplicates (if they exist):
df_PID = df[['PID', 'ImageFileName']].drop_duplicates()
PID ImageFileName
0 4 System
1 88 Registry
2 404 smss.exe
3 556 csrss.exe
4 632 wininit.exe
5 768 services.exe
6 1152 svchost.exe
...
I then renamed these columns to PPID and PPIDName, to make it easier to merge onto the original df to get the desired result. That and the merge are below:
df_PID.columns = ['PPID', 'PPIDName']
df = df.merge(df_PID, on='PPID', how='left')
This gives the below output, which I think is what you want:
PID PPID ImageFileName Another Col PPIDName
0 4 0 System 1 NaN
1 88 4 Registry 2 System
2 404 4 smss.exe 3 System
3 556 548 csrss.exe 4 NaN
4 632 548 wininit.exe 5 NaN
5 768 632 services.exe 6 wininit.exe
6 1152 768 svchost.exe 7 services.exe
7 2560 768 svchost.exe 8 services.exe
8 1668 768 svchost.exe 9 services.exe
9 1924 768 svchost.exe 10 services.exe
...