I have a dataframe that has 4 columns. I have to perform few lookups and then assign the value in one of the column. Here is the sample of data:
CategoryId ParentCategoryId SourceCategoryId SourceParentCategoryId
1 100 0
2 101 0
3 9.0 102 108
4 20.0 103 100
5 4.0 104 103
6 105 103
7 106 103
8 107 103
9 108 0
10 109 108
11 110 103
12 111 103
13 112 103
14 113 100
15 114 113
16 115 113
17 116 113
18 117 113
19 118 113
20 100 113
I am trying to populate values in ParentCategoryId Column with next logic: For each row we will look up the value in SourceParentCategoryId and if it is zero we will skip it. If the value in SourceParentCategoryId different than zero as in row 3, we can see that SourceParentCategoryId is 108. Then I need to look up that value in SourceCategoryId and identify to which CategoryId it belongs. In the example I provided we can see that CategoryId has CategoryId 9 which is the value I need to plug into row 3 in my ParentCategoryId column.
With same logic, row 4 has SourceParentCategoryId = 100 which is based on SourceCategoryId has CategoryId = 20 and then value 20 is added in row 4 in ParentCategoryId column.
I have done python for several years and I did a bunch of research already on this, however, I still don't understand how to even start with solving this problem.
Can you please help me solve this or at least give me some hints?
CodePudding user response:
Assuming you don't already have data in column ParentCategoryId
, you can use following approach:
- Self left join the dataframe on
SourceParentCategoryId == SourceCategoryId
. - If there are multiple matches, then the self join may have repeated records. Limit this by retaining first record.
- Reset values for
SourceParentCategoryId == 0
.
df = df.merge(df, left_on="SourceParentCategoryId", right_on="SourceCategoryId", how="left", suffixes=("", "_delete"))
# If there are multiple matches, then the self join may have repeated records. Limit this by retaining first record.
df = df.groupby(["CategoryId", "SourceCategoryId", "SourceParentCategoryId"])["CategoryId_delete"].first().reset_index()
df = df.rename(columns={"CategoryId_delete": "ParentCategoryId"})
# Reset values for `SourceParentCategoryId == 0`.
df["ParentCategoryId"] = df.apply(lambda row: None if row["SourceParentCategoryId"] == 0 else row["ParentCategoryId"], axis=1)
Output:
CategoryId SourceCategoryId SourceParentCategoryId ParentCategoryId
0 1 100 0 NaN
1 2 101 0 NaN
2 3 102 108 9.00
3 4 103 100 1.00
4 5 104 103 4.00
5 6 105 103 4.00
6 7 106 103 4.00
7 8 107 103 4.00
8 9 108 0 NaN
9 10 109 108 9.00
10 11 110 103 4.00
11 12 111 103 4.00
12 13 112 103 4.00
13 14 113 100 1.00
14 15 114 113 14.00
15 16 115 113 14.00
16 17 116 113 14.00
17 18 117 113 14.00
18 19 118 113 14.00
19 20 100 113 14.00
CodePudding user response:
Use DataFrame.drop_duplicates
for remove duplicates by SourceCategoryId
, so possible mapping SourceParentCategoryId
filtered for no 0
values in SourceParentCategoryId
by Series.map
:
m = df['SourceParentCategoryId'].ne(0)
s = df.drop_duplicates('SourceCategoryId').set_index('SourceCategoryId')['CategoryId']
df['ParentCategoryId'] = df.loc[m, 'SourceParentCategoryId'].map(s)
print (df)
CategoryId ParentCategoryId SourceCategoryId SourceParentCategoryId
0 1 NaN 100 0
1 2 NaN 101 0
2 3 9.0 102 108
3 4 1.0 103 100
4 5 4.0 104 103
5 6 4.0 105 103
6 7 4.0 106 103
7 8 4.0 107 103
8 9 NaN 108 0
9 10 9.0 109 108
10 11 4.0 110 103
11 12 4.0 111 103
12 13 4.0 112 103
13 14 1.0 113 100
14 15 14.0 114 113
15 16 14.0 115 113
16 17 14.0 116 113
17 18 14.0 117 113
18 19 14.0 118 113
19 20 14.0 100 113
For keep last duplicate values use keep='last'
parameter:
m = df['SourceParentCategoryId'].ne(0)
s = df.drop_duplicates('SourceCategoryId', keep='last').set_index('SourceCategoryId')['CategoryId']
df['ParentCategoryId'] = df.loc[m, 'SourceParentCategoryId'].map(s)
print (df)
CategoryId ParentCategoryId SourceCategoryId SourceParentCategoryId
0 1 NaN 100 0
1 2 NaN 101 0
2 3 9.0 102 108
3 4 20.0 103 100
4 5 4.0 104 103
5 6 4.0 105 103
6 7 4.0 106 103
7 8 4.0 107 103
8 9 NaN 108 0
9 10 9.0 109 108
10 11 4.0 110 103
11 12 4.0 111 103
12 13 4.0 112 103
13 14 20.0 113 100
14 15 14.0 114 113
15 16 14.0 115 113
16 17 14.0 116 113
17 18 14.0 117 113
18 19 14.0 118 113
19 20 14.0 100 113