I'm new to pandas and need help manipulating data per row and not the whole column based on a condition.
I have a DF that contains these columns:
Repository | Age |
---|---|
DMZ Linux | 65 days |
Linux | 3 days |
Windows | 95 days |
Condition is:
if 'DMZ' in Repository and age > 60 - true
if 'DMZ' in repo and age < 60 - false
if 'DMZ' not in repo and age > 90 true
if 'DMZ' not in repo and age < 90 - false
I need it to have an additional column named Outstanding and return string 'True' or 'False' depending on condition above.
My only problems is depending on whatever index the for loop is on, it applies to every row / whole column instead of different values per row.
It should look like this:
Repository | Age | Outstanding |
---|---|---|
DMZ Linux | 65 days | True |
Linux | 3 days | False |
Windows | 95 days | True |
But instead it looks like this:
Repository | Age | Outstanding |
---|---|---|
DMZ Linux | 65 days | True |
Linux | 3 days | True |
Windows | 95 days | True |
Since the last index is True, it applied True to the whole column.
I have this code snippet:
for i in range(len(report_data)):
line = report_data.loc[i]
if str(line['Age'] != ''):
new_val = str(line['Age']).replace('days', '')
no_space = new_val.replace('', '')
int_val = int(no_space)
if int_val > 60 and 'DMZ' in line['Repository']:
report_data['Outstanding']: line['Outstanding'] = 'True'
elif int_val > 90 and 'DMZ' not in line['Repository']:
report_data['Outstanding']: line['Outstanding'] = 'True'
else:
report_data['Outstanding']: line['Outstanding'] = 'False'
I tried the lambda function but I can't proceed since I have 2 IFs. Any clue on how I should properly assign per row and not the whole column?
Thank you for helping!
CodePudding user response:
You could use pandas.loc
and pandas.Series.str.contains
, with the invert ~
operator:
import pandas as pd
data = {'Repository':['DMZ Linux','Linux','Windows'],
'Age':['65 days','3 days','95 days']
}
df = pd.DataFrame(data)
# Create a new column with the age as an int
df['int_Age']=df['Age'].str.split(' ').str[0].astype(int)
# Create en empty column
df['Outstanding'] = ""
# Implement your conditions to fill the Outstanding column
df.loc[(df['Repository'].str.contains('DMZ')) & (df['int_Age']>60), 'Outstanding'] = 'True'
df.loc[(df['Repository'].str.contains('DMZ')) & (df['int_Age']<60), 'Outstanding'] = 'False'
df.loc[(~df['Repository'].str.contains('DMZ')) & (df['int_Age']>90), 'Outstanding'] = 'True'
df.loc[(~df['Repository'].str.contains('DMZ')) & (df['int_Age']<90), 'Outstanding'] = 'False'
df
gives the expected output:
Repository Age int_Age Outstanding
0 DMZ Linux 65 days 65 True
1 Linux 3 days 3 False
2 Windows 95 days 95 True
You might have one less condition if you create the 'Outstanding'
column with False by default.
CodePudding user response:
Here is another way to achieve your goal.
- use
str.extract()
function to extract the integer from theAge
column - use simplified conditions for
Outstanding
column
import pandas as pd
# Create dataframe
df2 = pd.DataFrame({'Repository':['DMZ Linux','Linux','Windows'],
'Age':['65 days','3 days','95 days']})
# Convert age column to integer for calculation
df2['Age_val']= df2['Age'].str.extract(r'(\d*)').astype(int)
# Identify outstanding conditions
df2['Outstanding'] = (df2['Repository'].str.contains('DMZ') & (df2['Age_val']>60)) |\
(~df2['Repository'].str.contains('DMZ', case=False) & (df2['Age_val']>90))
# Remove Age_val column - not required anymore
df2 = df2.drop('Age_val', axis=1)
print(df2)
OUTPUT:
Repository Age Outstanding
0 DMZ Linux 65 days True
1 Linux 3 days False
2 Windows 95 days True