I am sorry if this is a duplicate question, I did hunt around a bit before I felt like I had to post a question.
I am trying to assign a value in a new column devicevalue
based on a value of another 2 columns. My dataframe looks a bit like this;
devicename make devicevalue
switch1 cisco 0
switch1-web100 netgear 0
switch10 cisco 0
switch23 cisco 1
switch31-web200 netgear 0
switch31 cisco 1
switch41-new cisco 1
switch40e cisco 1
switch31-web200-new netgear 0
switch40e cisco 1
switch11-data100e netgear 0
I am trying to add a value depending on these criteria;
- If
make == netgear
(set to 0) - If the value after switch is 20 or greater (set to 1, otherwise set to 0)
(If both conditions met, set to 0, i.e. condition of "make == netgear
set to 0" takes precedence. Note that this is different from the existing codes where the 2nd condition override (and overwrite result value) if both conditions met.)
I originally had some help getting this together however some devices now have a -new
and p
or a
or e
which breaks the code that looking at a number at the end of the string
The code I am using is essentially;
def get_number_suffix(devicename: str) -> int:
i = 1
while i < len(devicename) and devicename[-i:].isnumeric():
i = 1
return int(devicename[-(i-1):])
def compute_devicevalue(row) -> int:
if 'netgear' in row['make']:
return 0
if 20 <= get_number_suffix(row['devicename']):
return 1
else:
return 0
df['devicevalue'] = df.apply(compute_devicevalue, axis=1)
this worked fine before the new additions to the end of some of the naming, now it obviously breaks.
I have tried all sorts of ways but I can't find a decent way that ignores -new
and p
or a
or e
edit
Sorry all, I completely messed up what I was trying to ask, I'm trying to do the value based on the value after 'switch'
.
Essentially using the existing code when it converts the string to an integer and does len
it falls over on any name that has a -new
and p
or a
or e
following it
as an example saying
ValueError: invalid literal for int() with base 10: 'switch23-new'
CodePudding user response:
You can use .loc
and str.extract()
, as follows:
df['devicevalue'] = 0 # init value to 0
# Set to 1 if the value after 'switch' >= 20.
# Otherwise part is set during init to 0 at the first statement
df.loc[df['devicename'].str.extract(r'switch(\d )', expand=False).astype(float) >= 20, 'devicevalue'] = 1
# Set to 0 if `make` == 'netgear'
df.loc[df['make'] == 'netgear', 'devicevalue'] = 0
# If you have 2 or more values of `make` to match, use, e.g.:
#df.loc[df['make'].isin(['netgear', 'dell']), 'devicevalue'] = 0
Regex r'switch(\d )'
works together with str.extract()
to extract the digits after 'switch'
no matter they are at the end or in the middle. Therefore, it solves your problem of having the digits previously at the end now at the middle.
Result:
devicename make devicevalue
0 switch1 cisco 0
1 switch1-web100 netgear 0
2 switch10 cisco 0
3 switch23 cisco 1
4 switch31-web200 netgear 0
5 switch31 cisco 1
6 switch41-new cisco 1
7 switch40e cisco 1
8 switch31-web200-new netgear 0
9 switch40e cisco 1
10 switch11-data100e netgear 0
CodePudding user response:
I tried with regex to extract number from string, here for example.
For my simplicity I converted your dataframe to list
a = [{"devicename" : "switch1","make": "cisco", "devicevalue" :0}, {"devicename" : "switch1-web100", "make" : "netgear", "devicevalue" :0}, {"devicename" : "switch10" , "make" : "cisco", "devicevalue" :0}.... ]
Then I used this function to do it:
import re
def clean_data(data):
for i in range(len(data)): #remove this if using dataframe row
row = data[i] #Dict
if row["make"] == "netgear":
row["devicevalue"] = 0
tmp = -1
if "web" in row["devicename"]:
tmp = [int(s) for s in re.findall(r'\d ', row["devicename"].split("web")[1])][0]
elif "data" in row["devicename"]:
tmp = [int(s) for s in re.findall(r'\d ', row["devicename"].split("data")[1])][0]
if tmp >= 200:
row["devicevalue"] = 0
elif tmp == -1:
pass #Nothing to change
data[i] = row
return data #remove this and return row
I get the following
[{'devicename': 'switch1', 'make': 'cisco', 'devicevalue': 0}, {'devicename': 'switch1-web100', 'make': 'netgear', 'devicevalue': 0}, {'devicename': 'switch10', 'make': 'cisco', 'devicevalue': 0}, {'devicename': 'switch23', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch31-web200', 'make': 'netgear', 'devicevalue': 0}, {'devicename': 'switch31', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch40', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch23', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch31-web200-new', 'make': 'netgear', 'devicevalue': 0}, {'devicename': 'switch31-web100a', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch40', 'make': 'cisco', 'devicevalue': 1}, {'devicename': 'switch11-data100e', 'make': 'cisco', 'devicevalue': 1}]
Since you are sending rows of dataframe, remove the outer loop and return row instead of data in your code