How to perform multiple lookups based on column values and populate empty column?-CodePudding

I have a dataframe that has 4 columns. I have to perform few lookups and then assign the value in one of the column. Here is the sample of data:

CategoryId  ParentCategoryId  SourceCategoryId  SourceParentCategoryId
         1                                 100                       0
         2                                 101                       0
         3               9.0               102                     108
         4              20.0               103                     100
         5               4.0               104                     103
         6                                 105                     103
         7                                 106                     103
         8                                 107                     103
         9                                 108                       0
        10                                 109                     108
        11                                 110                     103
        12                                 111                     103
        13                                 112                     103
        14                                 113                     100
        15                                 114                     113
        16                                 115                     113
        17                                 116                     113
        18                                 117                     113
        19                                 118                     113
        20                                 100                     113

I am trying to populate values in ParentCategoryId Column with next logic: For each row we will look up the value in SourceParentCategoryId and if it is zero we will skip it. If the value in SourceParentCategoryId different than zero as in row 3, we can see that SourceParentCategoryId is 108. Then I need to look up that value in SourceCategoryId and identify to which CategoryId it belongs. In the example I provided we can see that CategoryId has CategoryId 9 which is the value I need to plug into row 3 in my ParentCategoryId column.

With same logic, row 4 has SourceParentCategoryId = 100 which is based on SourceCategoryId has CategoryId = 20 and then value 20 is added in row 4 in ParentCategoryId column.

I have done python for several years and I did a bunch of research already on this, however, I still don't understand how to even start with solving this problem.

Can you please help me solve this or at least give me some hints?

CodePudding user response：

Assuming you don't already have data in column ParentCategoryId, you can use following approach:

Self left join the dataframe on SourceParentCategoryId == SourceCategoryId.
If there are multiple matches, then the self join may have repeated records. Limit this by retaining first record.
Reset values for SourceParentCategoryId == 0.

df = df.merge(df, left_on="SourceParentCategoryId", right_on="SourceCategoryId", how="left", suffixes=("", "_delete"))

# If there are multiple matches, then the self join may have repeated records. Limit this by retaining first record.
df = df.groupby(["CategoryId", "SourceCategoryId", "SourceParentCategoryId"])["CategoryId_delete"].first().reset_index()

df = df.rename(columns={"CategoryId_delete": "ParentCategoryId"})

# Reset values for `SourceParentCategoryId == 0`.
df["ParentCategoryId"] = df.apply(lambda row: None if row["SourceParentCategoryId"] == 0 else row["ParentCategoryId"], axis=1)

Output:

    CategoryId  SourceCategoryId  SourceParentCategoryId  ParentCategoryId
0            1               100                       0               NaN
1            2               101                       0               NaN
2            3               102                     108              9.00
3            4               103                     100              1.00
4            5               104                     103              4.00
5            6               105                     103              4.00
6            7               106                     103              4.00
7            8               107                     103              4.00
8            9               108                       0               NaN
9           10               109                     108              9.00
10          11               110                     103              4.00
11          12               111                     103              4.00
12          13               112                     103              4.00
13          14               113                     100              1.00
14          15               114                     113             14.00
15          16               115                     113             14.00
16          17               116                     113             14.00
17          18               117                     113             14.00
18          19               118                     113             14.00
19          20               100                     113             14.00

CodePudding user response：

Use DataFrame.drop_duplicates for remove duplicates by SourceCategoryId, so possible mapping SourceParentCategoryId filtered for no 0 values in SourceParentCategoryId by Series.map:

m = df['SourceParentCategoryId'].ne(0)

s = df.drop_duplicates('SourceCategoryId').set_index('SourceCategoryId')['CategoryId']
df['ParentCategoryId'] = df.loc[m, 'SourceParentCategoryId'].map(s)
print (df)
    CategoryId  ParentCategoryId  SourceCategoryId  SourceParentCategoryId
0            1               NaN               100                       0
1            2               NaN               101                       0
2            3               9.0               102                     108
3            4               1.0               103                     100
4            5               4.0               104                     103
5            6               4.0               105                     103
6            7               4.0               106                     103
7            8               4.0               107                     103
8            9               NaN               108                       0
9           10               9.0               109                     108
10          11               4.0               110                     103
11          12               4.0               111                     103
12          13               4.0               112                     103
13          14               1.0               113                     100
14          15              14.0               114                     113
15          16              14.0               115                     113
16          17              14.0               116                     113
17          18              14.0               117                     113
18          19              14.0               118                     113
19          20              14.0               100                     113

For keep last duplicate values use keep='last' parameter:

m = df['SourceParentCategoryId'].ne(0)

s = df.drop_duplicates('SourceCategoryId', keep='last').set_index('SourceCategoryId')['CategoryId']
df['ParentCategoryId'] = df.loc[m, 'SourceParentCategoryId'].map(s)
print (df)
    CategoryId  ParentCategoryId  SourceCategoryId  SourceParentCategoryId
0            1               NaN               100                       0
1            2               NaN               101                       0
2            3               9.0               102                     108
3            4              20.0               103                     100
4            5               4.0               104                     103
5            6               4.0               105                     103
6            7               4.0               106                     103
7            8               4.0               107                     103
8            9               NaN               108                       0
9           10               9.0               109                     108
10          11               4.0               110                     103
11          12               4.0               111                     103
12          13               4.0               112                     103
13          14              20.0               113                     100
14          15              14.0               114                     113
15          16              14.0               115                     113
16          17              14.0               116                     113
17          18              14.0               117                     113
18          19              14.0               118                     113
19          20              14.0               100                     113