This is the dataframe:
data = {"Company" : [["ConsenSys"] , ["Cognizant"], ["IBM"], ["IBM"], ["Reddit, Inc"], ["Reddit, Inc"], ["IBM"]],
"skills" : [['services', 'scientist technical expertise', 'databases'], ['datacomputing tools experience', 'deep learning models', 'cloud services'], ['quantitative analytical projects', 'financial services', 'field experience'],
['filesystems server architectures', 'systems', 'statistical analysis', 'data analytics', 'workflows', 'aws cloud services'], ['aws services'], ['data mining statistics', 'statistical analysis', 'aws cloud', 'services', 'data discovery', 'visualization'], ['communication skills experience', 'services', 'manufacturing environment', 'sox compliance']]}
dff = pd.DataFrame(data)
dff
- I need to create a new column, and I want to start by taking specific words out of the skills column.
- The row that does not include those specific words should then be deleted.
- Specific words: 'services', 'statistical analysis'
Expected Output:
Company | skills | new_col | |
---|---|---|---|
0 | [ConsenSys] | [services, scientist technical expertise, databases] | [services] |
1 | [IBM] | [filesystems server architectures, systems, statistical analysis, data analytics, workflows, aws cloud services] | [services, statistical analysis] |
2 | [Reddit, Inc] | [data mining statistics, statistical analysis, aws cloud, services, data discovery, visualization] | [statistical analysis] |
3 | [IBM] | ['communication skills experience', 'services', 'manufacturing environment', 'sox compliance'] | [services] |
I tried quite a lot of code in an effort to extract a specific word from the one that was available on Stack Overflow, but I was unsuccessful.
CodePudding user response:
You can use a lambda with a list comp
words = ["services", "statistical analysis"]
dff["found"] = dff["skills"].apply(lambda x: ", ".join(set([i for i in x if i in words])).split(", "))
CodePudding user response:
word = ['services', 'statistical analysis']
s1 = df['skills'].apply(lambda x: [i for i in word if i in x])
output(s1
):
0 [services]
1 []
2 []
3 [statistical analysis]
4 []
5 [services, statistical analysis]
6 [services]
Name: skills, dtype: object
make s1
to new_col
and boolean indexing
df.assign(new_col=s1)[lambda x: x['new_col'].astype('bool')]
result:
Company skills new_col
0 [ConsenSys] [services, scientist technical expertise, data... [services]
3 [IBM] [filesystems server architectures, systems, st... [statistical analysis]
5 [Reddit, Inc] [data mining statistics, statistical analysis,... [services, statistical analysis]
6 [IBM] [communication skills experience, services, ma... [services]
i think you should make more simple example 6 [IBM] [communication skills experience, services, ma... [services]