Home > database >  Manipulate Data per row pandas
Manipulate Data per row pandas

Time:11-05

I'm new to pandas and need help manipulating data per row and not the whole column based on a condition.

I have a DF that contains these columns:

Repository Age
DMZ Linux 65 days
Linux 3 days
Windows 95 days

Condition is:

  • if 'DMZ' in Repository and age > 60 - true

  • if 'DMZ' in repo and age < 60 - false

  • if 'DMZ' not in repo and age > 90 true

  • if 'DMZ' not in repo and age < 90 - false

I need it to have an additional column named Outstanding and return string 'True' or 'False' depending on condition above.

My only problems is depending on whatever index the for loop is on, it applies to every row / whole column instead of different values per row.

It should look like this:

Repository Age Outstanding
DMZ Linux 65 days True
Linux 3 days False
Windows 95 days True

But instead it looks like this:

Repository Age Outstanding
DMZ Linux 65 days True
Linux 3 days True
Windows 95 days True

Since the last index is True, it applied True to the whole column.

I have this code snippet:

for i in range(len(report_data)):
    line = report_data.loc[i]
    if str(line['Age'] != ''):
        new_val = str(line['Age']).replace('days', '')
        no_space = new_val.replace('', '')
        int_val = int(no_space)
        
        if int_val > 60 and 'DMZ' in line['Repository']:
            report_data['Outstanding']: line['Outstanding'] = 'True'
        elif int_val > 90 and 'DMZ' not in line['Repository']:
            report_data['Outstanding']: line['Outstanding'] = 'True'
        else:
            report_data['Outstanding']: line['Outstanding'] = 'False'

I tried the lambda function but I can't proceed since I have 2 IFs. Any clue on how I should properly assign per row and not the whole column?

Thank you for helping!

CodePudding user response:

You could use pandas.loc and pandas.Series.str.contains, with the invert ~ operator:

import pandas as pd
data = {'Repository':['DMZ Linux','Linux','Windows'],
        'Age':['65 days','3 days','95 days']
    }
df = pd.DataFrame(data)

# Create a new column with the age as an int
df['int_Age']=df['Age'].str.split(' ').str[0].astype(int)

# Create en empty column
df['Outstanding'] = ""

# Implement your conditions to fill the Outstanding column
df.loc[(df['Repository'].str.contains('DMZ')) & (df['int_Age']>60), 'Outstanding'] = 'True'
df.loc[(df['Repository'].str.contains('DMZ')) & (df['int_Age']<60), 'Outstanding'] = 'False'
df.loc[(~df['Repository'].str.contains('DMZ')) & (df['int_Age']>90), 'Outstanding'] = 'True'
df.loc[(~df['Repository'].str.contains('DMZ')) & (df['int_Age']<90), 'Outstanding'] = 'False'
df

gives the expected output:

  Repository      Age  int_Age Outstanding
0  DMZ Linux  65 days       65        True
1      Linux   3 days        3       False
2    Windows  95 days       95        True

You might have one less condition if you create the 'Outstanding' column with False by default.

CodePudding user response:

Here is another way to achieve your goal.

  • use str.extract() function to extract the integer from the Age column
  • use simplified conditions for Outstanding column
import pandas as pd

# Create dataframe
df2 = pd.DataFrame({'Repository':['DMZ Linux','Linux','Windows'],
                    'Age':['65 days','3 days','95 days']})

# Convert age column to integer for calculation
df2['Age_val']= df2['Age'].str.extract(r'(\d*)').astype(int)

# Identify outstanding conditions
df2['Outstanding'] = (df2['Repository'].str.contains('DMZ') & (df2['Age_val']>60)) |\
                     (~df2['Repository'].str.contains('DMZ', case=False) & (df2['Age_val']>90))

# Remove Age_val column - not required anymore
df2 = df2.drop('Age_val', axis=1)

print(df2)

OUTPUT:

  Repository      Age  Outstanding
0  DMZ Linux  65 days         True
1      Linux   3 days        False
2    Windows  95 days         True
  • Related