Turning keywords into lists in python dataframe columns-CodePudding

I've extracted key words from a different column to make a new column (hard skills) that looks like this: (https://i.stack.imgur.com/XNOIK.png)

But I want to make each key word into a list format within the "hards skills" column. For example, for the 1st row of "hard skills" column, my desired outcome would be:

['Python Programming', 'Machine Learning', 'Data Analysis'...]

instead of

Python Programming,Machine Learning,Data Analysis...

This is how I filtered the key words out into the new "hard skills" column.

#Filter and make new column for hard skills hard_skills = ['Python Programming', 'Statistics', 'Statistical Hypothesis Testing','Data Cleansing', 'Tensorflow', 'Machine Learning', 'Data Analysis','Data Visualization', 'Cloud Computing', 'R Programming', 'Data Science','Computer Programming', 'Deep Learning', 'Data Analysis', 'SQL', 'Regression Analysis', 'Algorithms', 'JavaScript', 'Python']

def get_hard_skills(skills): return_hard_skills = "" for hard_skill in hard_skills: if skills.find(hard_skill) >= 0: if return_hard_skills == "": return_hard_skills = hard_skill else: return_hard_skills = return_hard_skills "," hard_skill if return_hard_skills == "": return ('not found') else: return return_hard_skills

course_name_skills['hard skills'] = ""

#loop through data frame to get hard skills

for i in range(0,len(course_name_skills)-1): #every single line of the data science courses skills = course_name_skills.loc[i,"skills"] if not isNaN(skills): #if not empty course_name_skills.loc[i,"hard skills"] = get_hard_skills(skills)

course_name_skills = course_name_skills.replace('not found',np.NaN) only_hardskills = course_name_skills.dropna(subset=['hard skills'])

Is there a way I could change the code that filtered the data frame for key words? Or is there a more efficient way?

I tried strip.() or even tried my luck with

return_hard_skills = "["   return_hard_skills   ","   hard_skill   "]"

But it didn't come through.

[Dataframe with original column]

random, but needed, true or false table generated

CodePudding user response：

IIUC, there is no need for functions and/or loops here since you can use pandas.Series.str.join to get your expected column/output :

course_name_skills["hard skills"]= course_name_skills["skills"].str.join(",")

NB: The line above assumes that the column hard skills holds lists, otherwise (if strings) use this :

course_name_skills["hard skills"]= (
                                    course_name_skills["skills"]
                                        .str.strip("[]")
                                        .replace({"'": "", "\s ": ""}, regex=True)
                                    )

CodePudding user response：

df['skills'] = df['hard skills'].str.split(',').apply(
    lambda skills: [skill.strip() for skill in skills]
)

And if you want to add filtering:

skills = list(set(df['skills'].sum()))
for skill in skills:
  df[skill] = df['skills'].apply(lambda x: skill in x)

df.loc[df['Data Analysis']==True]['course_name']