Home > OS >  Emulating xlookup with Pandas or Numpy Python
Emulating xlookup with Pandas or Numpy Python

Time:10-13

I am pretty novel with programming and I've been stuck for a few days trying to figure out how to solve a problem. Let's say I have a Data Frame with the following info:

  • Employee ID
  • Manager ID
  • Boolean (This column would be the result of a pd.isin(manager_df) from another DataFrame.
  • Top Manager ID (This is the head of the business)

Top manager_df = This is a data frame with the Top manager's ID and the business they belong to and lead. This is a csv catalog that I want to keep updated so I do not need to change/fix lines of code with the managers.

What I am trying here is to get to figure out the top manager of every employee. There would be cases in which the Manager would be the Top and other cases in which there would be several managers in between both, the worker and the Top Manager.

For a few cases the Top Manager is the same as the managers I used the following code:


df_managers.loc[df_managers["Boolean"] == True, "Top Manager"] = df_managers["Worker's Manager WD ID"]

But it doesn´t work, I know if I pass it a string with the Employee ID of the Top manager that reports to it would work. But in this case, I´d like to know if there is a way of doing this way so if this catalog of manager_df is changed we do not need to fix the script. What I try to avoid is this:

# Having to change it manually. 

df_managers.loc[df_managers["Manager ID"] == "123", "Top Manager"] = "123"
df_managers.loc[df_managers["Manager ID"] == "1234567", "Top Manager"] = "1234567"

So this would be one of the questions I have, but as I said before there are many cases in which the direct manager is not the Top Manager, so I was wondering if there is a way of reaching this top manager through the managers of the managers iterating somehow and checking if the name is in the manager_df with a pd.isin(). But I don't really know if that is possible.

The way we were doing it before was using the xlookup method in Excel, so I looked for a way of emulating it in python. However, it doesn't work out that well and it only reaches the manager of the manager and not the top managers in the cases where there are several. And I need to create a copy of the main data frame to compare both.

# xlook up in python

def xlookup(lookup_value, lookup_array, return_array, if_not_found:str = ''):
    match_value = return_array.loc[lookup_array == lookup_value]
    if match_value.empty:
        return f'"{lookup_value}" not found!' if if_not_found == '' else if_not_found

    else:
        return match_value.tolist()[0]  

# This looks for the Manager ID in the Employee ID column of the other Dataframe to bring in the Manager ID to whom that employee reports.
    
copy_df["Top Manager"] = copy_df["Manager ID"].apply(xlookup, args = (main_df["Employee ID"],main_df["Manager ID"] ))                                                 

As I said, this doesn't look very practical and doesn't really work out since I will probably need a loop that I don't know how to build with this. I was wondering if there is a way with pandas loc or iloc, or even NumPy to solve this issue.

I leave here a toy df in case you want to give a try:

import pandas as pd

# In the case of the managers of the top managers, they report to the holding so their
managers would not be considered as our employees, since they are the head of the businesses.

df_top_managers = pd.DataFrame.from_dict(
    {
    'Employee ID': ['1007', '1008'], 
    'Manager ID': ['1009', '1009']

    }
)

# The employee '1009' doesn't belong to our company so we put it there instead of having a NaN. 

# The NaN in the top managers is what I want to fill up with the right ID of the top managers.

# The heads of the business are '1007' and'1008'.

df = pd.DataFrame.from_dict(
       {
        'Employee ID': ['1001', '1002', '1003', '1004','1005','1006','1007','1008'],
        'Manager ID': ['1005', '1003', '1008', '1006','1007','1008','1009','1009'],
        'Boolean': ['False', 'False', 'True', 'False','True','True','True','True'],
        'Top Manager': ['NaN', 'NaN', 'NaN', 'NaN','NaN','NaN','NaN','NaN']
    }
)

print(df)

My expected output would be something like this

# There might be several managers in between the Employee ID and the Top Manager, this is just a small example.

Employee ID Manager ID Boolean Top Manager (expected output)
     1001       1005     False         1007
     1002       1003     False         1008       
     1003       1008     True          1008       
     1004       1006     False         1007
     1005       1007     True          1007
     1006       1008     True          1007
     1007       1009     True          1009 (1007) 
     1008       1009     True          1009 (1008)

I am aware it is a bit confusing and I have tried to explain it in the best possible way, but if there is any doubt about it, please feel free to ask!

Thank you very much again,

CodePudding user response:

Using df_top_managers and df shared in the question, if one wants to populate the column Top Manager of df with the Manager ID in the dataframe df_top_managers for a given Employee ID, one can use .map as follows

df['Top Manager'] = df['Employee ID'].map(df_top_managers.set_index('Employee ID')['Manager ID'])

[Out]:

  Employee ID Manager ID Boolean Top Manager
0        1001       1005   False         NaN
1        1002       1003   False         NaN
2        1003       1008   False         NaN
3        1004       1007   False         NaN
4        1005       1007   False         NaN
5        1006       1008   False         NaN
6        1007       1009    True        1009
7        1008       1009    True        1009
  • Related