I have a pandas dataframe
index | DevType | Count |
---|---|---|
1 | Developer, back-end | 3086 |
2 | Developer, back-end;Developer, front-end;Devel... | 2227 |
3 | Developer, back-end;Developer, full-stack | 1476 |
4 | Developer, front-end | 1401 |
5 | Developer, back-end;Developer, desktop or ente... | 605 |
6 | Developer, embedded applications or devices | 433 |
This is achieved by applying .value_counts()
on a column, as you can see Developer is repeated as it is combined with others answers, from this dataframe I want to create a possiblewords list to count number of each of them repeated later on.
I tried the code below to find the unique values first
unqlist=list(df_new['DevType'].unique())
by using 'unqlist' i tried to seperate distinct words using below code
possiblewords=[]
for word in unqlist:
print(word.split(','))
possiblewords.append(word)
it's not working
CodePudding user response:
Here is an example:
list(set(''.join(filter(lambda x: isinstance(x, str), devtype_list)).split(',')))
CodePudding user response:
You can split the list with ,
and ;
as delimiter to separate the unique words.
def split_words(x):
return sum(list(map(lambda y: y.split(";"), x.split(','))), [])
devtype_list = ['Developer, desktop or enterprise applications;Developer, full-stack', 'Developer, full-stack;Developer, mobile', 'nan', 'Designer;Developer, front-end;Developer, mobile', 'Developer, back-end;Developer, front-end;Developer, QA or test;DevOps specialist', 'Developer, back-end;Developer, desktop or enterprise applications;Developer, game or graphics', 'Developer, full-stack', 'Database administrator;']
newlist = list(set(sum(list(map(lambda x: split_words(x), devtype_list)), [])))
newlist = list(map(lambda x: x.strip(), newlist))
for unique_word in newlist:
print(unique_word)
Result:
Developer
front-end
Designer
desktop or enterprise applications
game or graphics
mobile
Database administrator
QA or test
DevOps specialist
nan
back-end
full-stack
CodePudding user response:
You can use Pandas .str.split()
to split on comma and semicolon, put the result in a numpy array. Then, use np.unique
to get the unique words after flatten from 2D array to 1D array, as follows:
import numpy as np
list_all = df_new['DevType'].str.split(r'(?:,|;)\s*').dropna().to_numpy()
list_unique = np.unique(sum(list_all, []))
Result:
print(list_unique)
['Devel...' 'Developer' 'back-end' 'desktop or ente...'
'embedded applications or devices' 'front-end' 'full-stack']