Split the distinct values in a list separated by a comma-CodePudding

I have a pandas dataframe

index	DevType	Count
1	Developer, back-end	3086
2	Developer, back-end;Developer, front-end;Devel...	2227
3	Developer, back-end;Developer, full-stack	1476
4	Developer, front-end	1401
5	Developer, back-end;Developer, desktop or ente...	605
6	Developer, embedded applications or devices	433

This is achieved by applying .value_counts() on a column, as you can see Developer is repeated as it is combined with others answers, from this dataframe I want to create a possiblewords list to count number of each of them repeated later on.

I tried the code below to find the unique values first

unqlist=list(df_new['DevType'].unique())

by using 'unqlist' i tried to seperate distinct words using below code

possiblewords=[]
for word in unqlist:
    print(word.split(','))
   possiblewords.append(word)

it's not working

CodePudding user response：

Here is an example:

list(set(''.join(filter(lambda x: isinstance(x, str), devtype_list)).split(',')))

CodePudding user response：

You can split the list with , and ; as delimiter to separate the unique words.

def split_words(x):
    return sum(list(map(lambda y: y.split(";"), x.split(','))), [])

devtype_list = ['Developer, desktop or enterprise applications;Developer, full-stack', 'Developer, full-stack;Developer, mobile', 'nan', 'Designer;Developer, front-end;Developer, mobile', 'Developer, back-end;Developer, front-end;Developer, QA or test;DevOps specialist', 'Developer, back-end;Developer, desktop or enterprise applications;Developer, game or graphics', 'Developer, full-stack', 'Database administrator;']
newlist = list(set(sum(list(map(lambda x: split_words(x), devtype_list)), [])))
newlist = list(map(lambda x: x.strip(), newlist))

for unique_word in newlist:
    print(unique_word)

Result:

Developer
front-end

Designer
desktop or enterprise applications
game or graphics
mobile
Database administrator
QA or test
DevOps specialist
nan
back-end
full-stack

CodePudding user response：

You can use Pandas .str.split() to split on comma and semicolon, put the result in a numpy array. Then, use np.unique to get the unique words after flatten from 2D array to 1D array, as follows:

import numpy as np

list_all = df_new['DevType'].str.split(r'(?:,|;)\s*').dropna().to_numpy()

list_unique = np.unique(sum(list_all, []))

Result:

print(list_unique)

['Devel...' 'Developer' 'back-end' 'desktop or ente...'
 'embedded applications or devices' 'front-end' 'full-stack']