Create new column and add value depending value in a string-CodePudding

I am sorry if this is a duplicate question, I did hunt around a bit before I felt like I had to post a question.

I am trying to assign a value in a new column based on a value of another column. My dataframe looks a bit like this;

devicename           make     devicevalue
switch1               cisco        0
switch1-web100        netgear      0  
switch10              cisco        0
switch23              cisco        1
switch31-web200       netgear      0
switch31              cisco        1
switch41-new          cisco        1
switch40e             cisco        1
switch31-web200-new   netgear      0
switch40e             cisco        1
switch11-data100e     netgear      0

I am trying to add a value depending on these criteria;

If make == netgear (set to 0)
If the value after switch is 20 or greater (set to 1, otherwise set to 0)

I originally had some help getting this together however some devices nows now have a -new and por a or e which breaks the code that looking at a number at the end of the string

The code I am using is essentially;

def get_number_suffix(devicename: str) -> int:
    i = 1
    while i < len(devicename) and devicename[-i:].isnumeric():
        i  = 1

    return int(devicename[-(i-1):])


def compute_devicevalue(row) -> int:
    if 'netgear' in row['make']:
        return 0
    if 20 <= get_number_suffix(row['devicename']):
        return 1
    else:
        return 0

df['devicevalue'] = df.apply(compute_devicevalue, axis=1)

this worked fine before the new additions to the end of some of the naming, now it obviously breaks. I have tried all sorts of ways but I cant find a decent way that ignores -new and por a or e

edit

Sorry all, I completely messed up what I was trying to ask, I'm trying to do the value based on the value after switch.

Essentially using the existing code when it converts the string to an integer and does len it falls over on any name that has a -new and por a or e following it

as an example saying

ValueError: invalid literal for int() with base 10: 'switch23-new'

CodePudding user response：

You can use .loc and str.extract(), as follows:

df['devicevalue'] = 0     # init value to 0

# Set to 0 if `make` == 'netgear'
df.loc[df['make'] == 'netgear', 'devicevalue'] = 0 

# Set to 1 if the value after 'switch' >= 20. 
# Otherwise part is set during init to 0 at the first statement
df.loc[df['devicename'].str.extract(r'switch(\d )', expand=False).astype(float) >= 20, 'devicevalue'] = 1

Regex r'switch(\d )' works together with str.extract() to extract the digits after 'switch' no matter they are at the end or in the middle. Therefore, it solves your problem of having the digits previously at the end now at the middle.

Result:

print(df)

             devicename     make  devicevalue
0               switch1    cisco            0
1        switch1-web100  netgear            0
2              switch10    cisco            0
3              switch23    cisco            0
4       switch31-web200  netgear            1
5              switch31    cisco            0
6              switch40    cisco            0
7              switch23    cisco            0
8   switch31-web200-new  netgear            1
9      switch31-web100a    cisco            0
10             switch40    cisco            0
11    switch11-data100e    cisco            0

CodePudding user response：

I tried with regex to extract number from string, here for example.

For my simplicity I converted your dataframe to list

a = [{"devicename" : "switch1","make": "cisco", "devicevalue" :0}, {"devicename" : "switch1-web100", "make" : "netgear", "devicevalue" :0}, {"devicename" : "switch10" , "make" : "cisco", "devicevalue" :0}.... ]

Then I used this function to do it:

import re

def clean_data(data):
    for i in range(len(data)): #remove this if using dataframe row
        row = data[i] #Dict
        if row["make"] == "netgear":
            row["devicevalue"] = 0
        
        tmp = -1
        if "web" in row["devicename"]:
            tmp = [int(s) for s in re.findall(r'\d ', row["devicename"].split("web")[1])][0]
        elif "data" in row["devicename"]:
            tmp = [int(s) for s in re.findall(r'\d ', row["devicename"].split("data")[1])][0]

        if tmp >= 200:
            row["devicevalue"] = 0
        elif tmp == -1:
            pass #Nothing to change

        data[i] = row 
    return data #remove this and return row

I get the following

[{'devicename': 'switch1', 'make': 'cisco', 'devicevalue': 0}, {'devicename': 'switch1-web100', 'make': 'netgear', 'devicevalue': 0}, {'devicename': 'switch10', 'make': 'cisco', 'devicevalue': 0}, {'devicename': 'switch23', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch31-web200', 'make': 'netgear', 'devicevalue': 0}, {'devicename': 'switch31', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch40', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch23', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch31-web200-new', 'make': 'netgear', 'devicevalue': 0}, {'devicename': 'switch31-web100a', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch40', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch11-data100e', 'make': 'cisco', 'devicevalue': 1}]

Since you are sending rows of dataframe, remove the outer loop and return row instead of data in your code