I'm trying to extract every skill from job_skills to be attribute and encoding it by zero or one , how i can do that ?
note : im trying to create a data frame but its not worth to fill the data frame manually (the code is below) , im search for method to extract a list from the column . i need to apply ML algorithms on this data
data = [['a', ['Python', 'UI',' Information Technology (IT)','Software Development','GTK','English',' Software Engineering']],
['b', ['Python', 'Relational Databases',' Celery',' VMWare','Django','Continous Integration',' Test Driven Development',' HTTP']],
['c', ['Flask', 'Python',' Celery',' Software Development',' Computer Science','Information Technology (IT)']],
['c', ['Flask', 'Python',' Celery',' Software Development',' Computer Science','Information Technology (IT)']]
]
df1= pd.DataFrame(data, columns=['col1', 'col2'])
pd.get_dummies(df1['col2'].explode()).groupby(level=0).sum()
CodePudding user response:
I can't think of anything out of the pandas box that will do this straight off. If I understand you want one hot variables for each skill for each person (row). Have you got a unique identifier for each job. If not you need one. In the example below I use the row.
skills = []
row = []
for index, row in df.iterrows():
for item in row['jobs_skills']:
row.append(row)
skills.append(item)
df = pd.DataFrame({'row': row, 'skills': skills})
Once you have df you can follow the same logic here:
# Input used:
print(df.to_string())
job_title company location job_skills
0 Python Or ItsTime Oakville, ['Python', 'UI', 'Computer Science', '. Information Technology (IT)', 'Software Development']
1 Senior Pyt CLOUDSIG Sofia, Bul ['Python3', 'Relational Databases', '. Celery', 'VMWare', '. Django',' Continous Integration']
2 Flask Pyth Cyber sec Cairo, Egy ['Flask', 'Python', '. Software Development', '. Computer Science', '. Information Technology (IT)']