Home > database >  How to modify the Dataframe
How to modify the Dataframe

Time:10-19

I have dataframe like this

Keyword                   devops       aws       ec2   python     java    dotnet
optional/mandatory        optional mandatory mandatory optional optional mandatory
Candidate Name                                                         
User1                       1          1          1        1        1         1
User2                       1          0          0        1        1         1
...                        ...        ...        ...      ...      ...       ...

User67                      1          1          1        0        0         0  

I want to calculate percentage match for each Candidate Name and add new Percentage match column also the row optional/mandatory play important role in calculating percentage. Like the keyword aws ec2 dotnet are mandatory if there count is 0 then set the percentage match for that Candidate to 0%. Means all mandatory keywords must have count 1 then only calculate the percentage of that cancidate.

Expected Output:

Keyword                   devops       aws       ec2   python     java    dotnet    Percentage
optional/mandatory        optional mandatory mandatory optional optional mandatory    Match
Candidate Name                                                         
User1                       1          1          1        1        1         1        100%
User2                       1          1          1        0        0         1        66.66%
...                        ...        ...        ...      ...      ...       ...

User67                      1          1          1        0        0         0         0%

for calculate percentage:

User1 is 100% as all mandatory have count 1

User2 is 66.66% as two optional have 0 count then Total Number of ons's/Total Keywords*100 (4/6)100

User3 is 0% as out of 3 mandatory one have zero count to directly the percentage is set to 0% irrespective of the optional keyword count

Keyword row and Candidate Name column keeps changing there can be more keywords and Candidate names in future. How to Implement the desire output.

My Code:

import numpy as np
import pandas as pd
read_csv = pd.read_csv('Sample.csv')
df = pd.DataFrame(read_csv)
df["Count"] = np.where(df["Count"] > 0, 1, 0)
print(df)
df = df.pivot_table(index="Candidate Name", columns=["optional/mandatory","Keyword"], values="Count")
print(df)

Output1:

Candidate Name    optional/mandatory Keyword  Count
      User1       optional           devops      1
      User1      mandatory              aws      0
      User1      mandatory              ec2      1
      User1       optional           python      1
      User1       optional             java      1
      User1      mandatory           dotnet      0
      User2       optional           devops      1
      User2      mandatory              aws      1
      User2      mandatory              ec2      0
      User3       optional           devops      1
      User3      mandatory              ec2      1
      User3      mandatory              aws      0
      User3       optional             java      1

optional/mandatory        mandatory             optional            
Keyword              aws dotnet  ec2   devops java python
Candidate Name                                           
    User1            0.0    0.0  1.0      1.0  1.0    1.0
    User2            1.0    NaN  0.0      1.0  NaN    NaN
    User3            0.0    NaN  1.0      1.0  1.0    NaN

Got the desired output by trying :

import pandas as pd

cols = pd.MultiIndex.from_arrays([
    ['optional', 'mandatory', 'mandatory', 'mandatory', 'optional', 'optional'],
    ['devops', 'aws', 'ec2', 'dotnet', 'python', 'java']
])

data = [
    ['User 1', 1,1,1,1,1,1],
    ['User 2',1,0,0,1,1,1],
    # ...,
    ['User N',0,1,1,1,0,0]
]

df = pd.DataFrame(
    columns=cols,
    data=[applicant[1:] for applicant in data],
    index=pd.Index([applicant[0] for applicant in data], name='Candidate Name')
)

df.sort_index(axis='columns', inplace=True)
And then, we can use apply to run a function against each of the rows:


def calculate_skill_coverage_percent(r):
    if sum(r['mandatory']) < len(r['mandatory']):
        return 0
    else:
        return sum(r) / len(r)
df['Percent_Match'] = df.apply(calculate_skill_coverage_percent, axis='columns')

print(df)

Output:

                 mandatory            optional             Percent_Match
                 aws dotnet ec2   devops java python              
Candidate Name                                                        
User 1                 1      1   1        1    1      1           1.0
User 2                 0      1   0        1    1      1           0.0
...
User N                 1      1   1        0    0      0           0.5

I am little confused because, I have hardcoded values inside pd.MultiIndex.from_arrays and data. How to get these values using the above Dataframe in Output1 or any other suggestion would work

cols = pd.MultiIndex.from_arrays([
    ['optional', 'mandatory', 'mandatory', 'mandatory', 'optional', 'optional'],
    ['devops', 'aws', 'ec2', 'dotnet', 'python', 'java']
])

data = [
    ['User 1', 1,1,1,1,1,1],
    ['User 2',1,0,0,1,1,1],
    # ...,
    ['User N',0,1,1,1,0,0]
]

CodePudding user response:

Instead calculate_skill_coverage_percent use mean by all rows - it is sum divided by length with set 0 if mean of mandatory is less like 1 by Series.mask:

s = df.xs('mandatory', level=0, axis=1).mean(axis=1)
df['Percent_Match'] = df.mean(axis=1).mask(s.lt(1), 0)

print(df)
               mandatory            optional             Percent_Match
                     aws dotnet ec2   devops java python              
Candidate Name                                                        
User 1                 1      1   1        1    1      1           1.0
User 2                 0      1   0        1    1      1           0.0
User N                 1      1   1        0    0      0           0.5

EDIT:

print (df)
   Candidate Name optional/mandatory Keyword  Count
0           User1           optional  devops      1
1           User1          mandatory     aws      0
2           User1          mandatory     ec2      1
3           User1           optional  python      1
4           User1           optional    java      1
5           User1          mandatory  dotnet      0
6           User2           optional  devops      1
7           User2          mandatory     aws      1
8           User2          mandatory     ec2      0
9           User3           optional  devops      1
10          User3          mandatory     ec2      1
11          User3          mandatory     aws      0
12          User3           optional    java      1

Use DataFrame.pivot for reshape:

df = (df.pivot('Candidate Name', ['optional/mandatory','Keyword'], 'Count')
        .fillna(0)
        .astype(int)
        .sort_index(axis='columns'))
print (df)
optional/mandatory mandatory            optional            
Keyword                  aws dotnet ec2   devops java python
Candidate Name                                              
User1                      0      0   1        1    1      1
User2                      1      0   0        1    0      0
User3                      0      0   1        1    1      0
  • Related