Home > Software engineering >  How can I use pandas to iteratively search for the value of column A in column B until there is no f
How can I use pandas to iteratively search for the value of column A in column B until there is no f

Time:01-18

As an example, I have the following table:

Group Name Parent Group ID Group ID
All Foods 3
Italian Foods 3 1
Pasta 1 2
Spaghetti 2 4

The group IDs are assigned arbitrarily. The table represents a series of nested groups. You can see that 'Spaghetti' has a Parent Group ID of 2, which corresponds to the Group ID of 'Pasta'. In turn, 'Pasta' has a Parent Group ID of 1, indicating that its parent group is Italian Foods.

I would like to be able to iterate across each row, identify the Parent Group ID and then follow the Parent Group ID - Group ID chain to find the top-level Parent Group Name for that row's group. In the case of row 5, 'Spaghetti', I would like to find that 'All Foods' is its top-level Parent Group Name.

I have so far tried to user pandas' outer join function to no avail. Any advice would be really appreciated.

The flow I have envisioned is this:

  1. add a column to the right of the table
  2. iterate through each row
  3. find groupID value
  4. search for groupID value in parentID column
  5. repeat until parentID value is empty
  6. add final row's group name onto original row's appended column

CodePudding user response:

You can use a while..loop to iterate through each row in the dataframe and use the .loc method to access the specific rows based on the Group ID and Parent Group ID columns. Here is an example of how you can implement this:

# Create the dataframe
df = pd.DataFrame(
    {
        "Group Name": [
            "All Foods",
            "Italian Foods",
            "Pasta",
            "Spaghetti",
            "Sushi",
            "Seafood",
            "Japanese Cusine",
        ],
        "Parent Group ID": [None, 3, 1, 2, 6, 7, None],
        "Group ID": [3, 1, 2, 4, 5, 6, 7],
    }
)
df
Group Name Parent Group ID Group ID
0 All Foods NaN 3
1 Italian Foods 3 1
2 Pasta 1 2
3 Spaghetti 2 4
4 Sushi 6 5
5 fish 7 6
6 Japanese Cusine NaN 7
# Add a new column to the dataframe
df['Top-level Parent Group Name'] = ""

# Iterate through each row in the dataframe
for index, row in df.iterrows():
    parent_id = row['Parent Group ID']
    fill_parent = False
    parent_row = pd.Series()
    # Traceback parent group name until a parent ID equal to `None` or `""` 
    # (empty string) is found
    while parent_id != "" and not pd.isna(parent_id):
        parent_row = df.loc[df['Group ID'] == parent_id]
        parent_id = [*parent_row['Parent Group ID'].values, ""][0]
        fill_parent = True
    if not parent_row.empty and fill_parent:
        df.at[index, 'Top-level Parent Group Name'] =
        parent_row['Group Name'].values[0]

df

Outputs:

Group Name Parent Group ID Group ID Top-level Parent Group Name
0 All Foods NaN 3
1 Italian Foods 3 1 All Foods
2 Pasta 1 2 All Foods
3 Spaghetti 2 4 All Foods
4 Sushi 6 5 Japanese Cusine
5 fish 7 6 Japanese Cusine
6 Japanese Cusine NaN 7

Explanation

The code adds a new column to the dataframe called Top-level Parent Group Name and populates it with the corresponding top-level parent group name for each row. In the case of the row for 'Spaghetti', the Top-level Parent Group Name will be 'All Foods'.

Remarks

It's important to note that this code assumes that each Group Name value only has one direct parent group. If a Group Name, like 'Pasta' in one line has a direct parent group of 'Italian Foods' and on another line has a direct parent group of 'Sushi', then the code will not work as expected. Also, the code will handle NaN values in the Parent Group ID column, so it won't cause an error.

  • Related