I have dataframe like this
Keyword devops aws ec2 python java dotnet
optional/mandatory optional mandatory mandatory optional optional mandatory
Candidate Name
User1 1 1 1 1 1 1
User2 1 0 0 1 1 1
... ... ... ... ... ... ...
User67 1 1 1 0 0 0
I want to calculate percentage match for each Candidate Name and add new Percentage match column also the row optional/mandatory play important role in calculating percentage. Like the keyword aws ec2 dotnet are mandatory if there count is 0 then set the percentage match for that Candidate to 0%. Means all mandatory keywords must have count 1 then only calculate the percentage of that cancidate.
Expected Output:
Keyword devops aws ec2 python java dotnet Percentage
optional/mandatory optional mandatory mandatory optional optional mandatory Match
Candidate Name
User1 1 1 1 1 1 1 100%
User2 1 1 1 0 0 1 66.66%
... ... ... ... ... ... ...
User67 1 1 1 0 0 0 0%
for calculate percentage:
User1 is 100% as all mandatory have count 1
User2 is 66.66% as two optional have 0 count then Total Number of ons's/Total Keywords*100
(4/6)100
User3 is 0% as out of 3 mandatory one have zero count to directly the percentage is set to 0% irrespective of the optional keyword count
Keyword row and Candidate Name column keeps changing there can be more keywords and Candidate names in future. How to Implement the desire output.
My Code:
import numpy as np
import pandas as pd
read_csv = pd.read_csv('Sample.csv')
df = pd.DataFrame(read_csv)
df["Count"] = np.where(df["Count"] > 0, 1, 0)
print(df)
df = df.pivot_table(index="Candidate Name", columns=["optional/mandatory","Keyword"], values="Count")
print(df)
Output:
Candidate Name optional/mandatory Keyword Count
User1 optional devops 1
User1 mandatory aws 0
User1 mandatory ec2 1
User1 optional python 1
User1 optional java 1
User1 mandatory dotnet 0
User2 optional devops 1
User2 mandatory aws 1
User2 mandatory ec2 0
User3 optional devops 1
User3 mandatory ec2 1
User3 mandatory aws 0
User3 optional java 1
optional/mandatory mandatory optional
Keyword aws dotnet ec2 devops java python
Candidate Name
User1 0.0 0.0 1.0 1.0 1.0 1.0
User2 1.0 NaN 0.0 1.0 NaN NaN
User3 0.0 NaN 1.0 1.0 1.0 NaN
CodePudding user response:
I took the liberty of reformatting your question a bit. Here's the setup:
import pandas as pd
# Set up your indices
cols = pd.MultiIndex.from_arrays([
['optional', 'mandatory', 'mandatory', 'mandatory', 'optional', 'optional'],
['devops', 'aws', 'ec2', 'dotnet', 'python', 'java']
])
# Set up the data however make sense,
# it's nice to keep name and values together
data = [
['User 1', 1,1,1,1,1,1],
['User 2',1,0,0,1,1,1],
# ...,
['User N',0,1,1,1,0,0]
]
# We can un-pack the name (index) and values with
# slicing and list comprehensions
df = pd.DataFrame(
columns=cols,
data=[applicant[1:] for applicant in data],
index=pd.Index([applicant[0] for applicant in data], name='Candidate Name')
)
# This sorts by index hirecery so you'll get all the
# mandatory items then all the optional items
df.sort_index(axis='columns', inplace=True)
And then, we can use apply
to run a function against each of the rows:
# You could use a lambda here but for readability
# we can also spell it out with a named function
# first we'll check if we have all mandatory items
# satisfied and then decide if it's a 0-coverage
# or percentage-coverage
def calculate_skill_coverage_percent(r):
if sum(r['mandatory']) < len(r['mandatory']):
return 0
else:
return sum(r) / len(r)
df['Percent_Match'] = df.apply(calculate_skill_coverage_percent, axis='columns')
print(df)
produces
mandatory optional Percent_Match
aws dotnet ec2 devops java python
Candidate Name
User 1 1 1 1 1 1 1 1.0
User 2 0 1 0 1 1 1 0.0
...
User N 1 1 1 0 0 0 0.5
CodePudding user response:
Creation of dataframe df
data={'Names': [ '','User1', 'User2','User67','User9080'],
'devops': ['optional',1,1,1,1],
'aws' : ['mandatory',1,0,1,1],
'ec2' : ['mandatory',1,0,1,1],
'python': ['optional',1,1,0,0],
'java' : ['optional',1,1,0,0],
'dotnet' : ['mandatory',1,1,0,1]}
df=pd.DataFrame(data)
df
# **Out:**
Names devops aws ec2 python java dotnet
0 optional mandatory mandatory optional optional mandatory
1 User1 1 1 1 1 1 1
2 User2 1 0 0 1 1 1
3 User67 1 1 1 0 0 0
4 User90 1 1 1 0 0 1
Data preparation:
# get as row col names
import numpy as np
df=pd.DataFrame(np.vstack([df.columns, df]))
# get as column mandatorys row
df.columns= df.iloc[1]
#elimination as row (duplicated)
df=df.drop(1)
# set "users" column as index for rows (disturbs later if not)
df.set_index('', inplace=True)
# saving and wiping old column names (python,aws,etc) disturbs later
names=df.iloc[0].copy()
df=df.drop('Names')
df
# ***Out***
1 optional mandatory mandatory optional optional mandatory
User1 1 1 1 1 1 1
User2 1 0 0 1 1 1
User67 1 1 1 0 0 0
User90 1 1 1 0 0 1
Next
# lenght of items (to calculate %)
items=len(df.columns)
# Function
def condiciones(df):
if df['mandatory'].all() == 1 :
return (df.sum(axis=0) / items) *100
else:
return 0
#apply
df['ya']=df.apply(condiciones,axis=1)
df
# *** Out:
1 optional mandatory mandatory optional optional mandatory ya
User1 1 1 1 1 1 1 100.000000
User2 1 0 0 1 1 1 0.000000
User67 1 1 1 0 0 0 0.000000
User90 1 1 1 0 0 1 66.666667
Reinserting
# rescuing users from index
df['Users'] = df.index
# Rescuing wiped row (aws,python...)
names.to_list() # add 'SCORE' at the end, is one more row
# add 'Users' at the end, as index is now row
df = (df.T.reset_index().T.reset_index(drop=True)
.set_axis(['devops', 'aws', 'ec2', 'python', 'java',
'dotnet',"SCORE","Users"], axis=1))
# Moving Users to first
cols = list(df.columns)
cols = [cols[-1]] cols[:-1]
df = df[cols]
Result
Out:
Users devops aws ec2 python java dotnet SCORE
0 Users optional mandatory mandatory optional optional mandatory ya
1 User1 1 1 1 1 1 1 100.0
2 User2 1 0 0 1 1 1 0.0
3 User67 1 1 1 0 0 0 0.0
4 User90 1 1 1 0 0 1 66.67