Trying to replace a pd.DataFrame columns by index, item in enumerate. I get different length-CodePudding

I am kind of stuck in a loop situiation related with dictionaries, panda´s series and DF and I would like to hear some advice on this.

The problem Im dealing with is the following:

I have a pd.DataFrame with a few column named :

df["country"] saying whether the person is form Spain or UK
df["lvl"] saying which employee level the person has within the company

Note that this levels for any reason Im not aware of dont match with the one I need to use, so I thought of using dictionaries / tuples to replace them.

So this would be the dictionaries, I used them to avoid raising any error with the keys and then convert them into tupples since since some of the values have the same key.

lvls_uk - with the lvls of UK
lvls_sp - with Spanish levels

So what Im aiming for is to replace based on the condition of where is the employee settled in ( UK or SP ) replace his/her level by the one in the dictionary. The problem is that when I run the loop it throws a bunch more of ocurrencies than the len of my column which is 513 (0 included, which makes it 514)

My code is the following:

# I need to go from 25 different levels to 7, the 0 is to avoid errors since a few doesnt have any type of level.

lvls_sp = {      "25":"1", "24" : "1", "23":"1",
                 "22":"2" , "21":"2",
                 "20":"3", "19":"3", "18":"3", "17":"3", 
                 "16":"4", "15":"4",  
                 "14":"5", "13":"5" , "12":"5",
                 "11":"6", "10":"6", "9":"6", 
                 "8":"7", "7":"7", "6":"7", "0":"0"
                }


lvls_uk = {   "25": "s8", 
              "24" : "s7h","23":"s7h",
              "22":"s7" , "21":"s7",
              "20":"s6","19":"s6", "18":"s6", 
              "17":"s5", 
              "15":"s4", "16": "s4", 
              "14":"s3", "13":"s3" , 
              "12":"s2h",
              "11":"s2", "10":"s2", "9":"s2", 
              "8": "s2l" , "7": "s2l", 
              "6": "s1", "0":"0"
             }

# I create the tupples:

tupla_sp = list()
tupla_uk = list()

for k,v in lvls_sp.items():
    tupla_sp.append((v, k))
    
for k,v in lvls_uk.items():
    tupla_uk.append((v, k))





# I tried different options but i finlly decided to go with by index and item

# Create a list to append them and convert it into a pd.series after
r_levels = []


for index, item in enumerate(df["country"]):
    if item == "SP":
        for k, v in tupla_sp:
            if df["lvl"][index] in v:
                r_levels .append(k)
    if item == "UK":
        for k, v in tupla_uk:
            if df["lvl"][index] in v:
                r_levels .append(k)

But whenever I run this I get a list with a length of 623.

Do you spot any mistake in the code? Im open to try different approaches. This is the first time Ive got to do this and Ive been stuck for a couple days.

Thank you so much!

CodePudding user response：

Perhaps you could do something like

df_spain = df[df["country"] == "SP"]
df_uk = df[df["country"] == "UK"]

and since you have the dictionaries already set up

df_spain = df_spain.replace(to_replace=lvls_sp)
df_spain = df_uk.replace(to_replace=lvls_uk)

df = pd.concat([df_spain, df_uk])

CodePudding user response：

This should do the trick, what it does is to take all employees from one country at a time and replace their level:

df.loc[df['country']=='UK', 'lvl'] = df.loc[df['country']=='UK', 'lvl'].replace(lvls_uk)
df.loc[df['country']=='SP', 'lvl'] = df.loc[df['country']=='SP', 'lvl'].replace(lvls_sp)

Also generally speaking, you should avoid using loops when working with DataFrames as they become very slow when working with larger amounts of data.

The problem why your code is not working as intended is that you are not comparing if the level from the DataFrame is equal to the level from the tuple, instead you only compare if it is in the level. This is problematic because the level "0" as a string is also part of level "20" and "10". This should repair your code, but as stated above is not advised to be used:

# Create a list to append them and convert it into a pd.series after
r_levels = []


for index, item in enumerate(df["country"]):
    if item == "SP":
        for k, v in tupla_sp:
            if df["lvl"][index] == v:
                r_levels .append(k)
    if item == "UK":
        for k, v in tupla_uk:
            if df["lvl"][index] == v:
                r_levels .append(k)