I have dataframe like this
Keyword devops aws ec2 python java dotnet
optional/mandatory optional mandatory mandatory optional optional mandatory
Candidate Name
User1 1 1 1 1 1 1
User2 1 0 0 1 1 1
... ... ... ... ... ... ...
User67 1 1 1 0 0 0
I want to calculate percentage match for each Candidate Name and add new Percentage match column also the row optional/mandatory play important role in calculating percentage. Like the keyword aws ec2 dotnet are mandatory if there count is 0 then set the percentage match for that Candidate to 0%. Means all mandatory keywords must have count 1 then only calculate the percentage of that cancidate.
Expected Output:
Keyword devops aws ec2 python java dotnet Percentage
optional/mandatory optional mandatory mandatory optional optional mandatory Match
Candidate Name
User1 1 1 1 1 1 1 100%
User2 1 1 1 0 0 1 66.66%
... ... ... ... ... ... ...
User67 1 1 1 0 0 0 0%
for calculate percentage:
User1 is 100% as all mandatory have count 1
User2 is 66.66% as two optional have 0 count then Total Number of ons's/Total Keywords*100
(4/6)100
User3 is 0% as out of 3 mandatory one have zero count to directly the percentage is set to 0% irrespective of the optional keyword count
Keyword row and Candidate Name column keeps changing there can be more keywords and Candidate names in future. How to Implement the desire output.
My Code:
import numpy as np
import pandas as pd
read_csv = pd.read_csv('Sample.csv')
df = pd.DataFrame(read_csv)
df["Count"] = np.where(df["Count"] > 0, 1, 0)
print(df)
df = df.pivot_table(index="Candidate Name", columns=["optional/mandatory","Keyword"], values="Count")
print(df)
Output1:
Candidate Name optional/mandatory Keyword Count
User1 optional devops 1
User1 mandatory aws 0
User1 mandatory ec2 1
User1 optional python 1
User1 optional java 1
User1 mandatory dotnet 0
User2 optional devops 1
User2 mandatory aws 1
User2 mandatory ec2 0
User3 optional devops 1
User3 mandatory ec2 1
User3 mandatory aws 0
User3 optional java 1
optional/mandatory mandatory optional
Keyword aws dotnet ec2 devops java python
Candidate Name
User1 0.0 0.0 1.0 1.0 1.0 1.0
User2 1.0 NaN 0.0 1.0 NaN NaN
User3 0.0 NaN 1.0 1.0 1.0 NaN
Got the desired output by trying :
import pandas as pd
cols = pd.MultiIndex.from_arrays([
['optional', 'mandatory', 'mandatory', 'mandatory', 'optional', 'optional'],
['devops', 'aws', 'ec2', 'dotnet', 'python', 'java']
])
data = [
['User 1', 1,1,1,1,1,1],
['User 2',1,0,0,1,1,1],
# ...,
['User N',0,1,1,1,0,0]
]
df = pd.DataFrame(
columns=cols,
data=[applicant[1:] for applicant in data],
index=pd.Index([applicant[0] for applicant in data], name='Candidate Name')
)
df.sort_index(axis='columns', inplace=True)
And then, we can use apply to run a function against each of the rows:
def calculate_skill_coverage_percent(r):
if sum(r['mandatory']) < len(r['mandatory']):
return 0
else:
return sum(r) / len(r)
df['Percent_Match'] = df.apply(calculate_skill_coverage_percent, axis='columns')
print(df)
Output:
mandatory optional Percent_Match
aws dotnet ec2 devops java python
Candidate Name
User 1 1 1 1 1 1 1 1.0
User 2 0 1 0 1 1 1 0.0
...
User N 1 1 1 0 0 0 0.5
I am little confused because, I have hardcoded values inside pd.MultiIndex.from_arrays and data. How to get these values using the above Dataframe in Output1 or any other suggestion would work
cols = pd.MultiIndex.from_arrays([
['optional', 'mandatory', 'mandatory', 'mandatory', 'optional', 'optional'],
['devops', 'aws', 'ec2', 'dotnet', 'python', 'java']
])
data = [
['User 1', 1,1,1,1,1,1],
['User 2',1,0,0,1,1,1],
# ...,
['User N',0,1,1,1,0,0]
]
CodePudding user response:
Instead calculate_skill_coverage_percent
use mean
by all rows - it is sum divided by length with set 0
if mean of mandatory
is less like 1
by Series.mask
:
s = df.xs('mandatory', level=0, axis=1).mean(axis=1)
df['Percent_Match'] = df.mean(axis=1).mask(s.lt(1), 0)
print(df)
mandatory optional Percent_Match
aws dotnet ec2 devops java python
Candidate Name
User 1 1 1 1 1 1 1 1.0
User 2 0 1 0 1 1 1 0.0
User N 1 1 1 0 0 0 0.5
EDIT:
print (df)
Candidate Name optional/mandatory Keyword Count
0 User1 optional devops 1
1 User1 mandatory aws 0
2 User1 mandatory ec2 1
3 User1 optional python 1
4 User1 optional java 1
5 User1 mandatory dotnet 0
6 User2 optional devops 1
7 User2 mandatory aws 1
8 User2 mandatory ec2 0
9 User3 optional devops 1
10 User3 mandatory ec2 1
11 User3 mandatory aws 0
12 User3 optional java 1
Use DataFrame.pivot
for reshape:
df = (df.pivot('Candidate Name', ['optional/mandatory','Keyword'], 'Count')
.fillna(0)
.astype(int)
.sort_index(axis='columns'))
print (df)
optional/mandatory mandatory optional
Keyword aws dotnet ec2 devops java python
Candidate Name
User1 0 0 1 1 1 1
User2 1 0 0 1 0 0
User3 0 0 1 1 1 0