List of unique characters of a dataset-CodePudding

I have a dataset in a dataframe and I want to see the total number of characters and the list of unique characters.

As for the total number of characters I have implemented the following code which seems is working well

df["Preprocessed_Text"].str.len().sum()

Could you please let me know how to get a list with the unique characters (not including the space)?

CodePudding user response：

Try this:

from string import ascii_letters

chars = set(''.join(df["Preprocessed_Text"])).intersection(ascii_letters)

If you need to work with a different alphabet, then simply replace ascii_letters with whatever you need.

If you want every character but the space then:

chars = set(''.join(df["Preprocessed_Text"]).replace(' ', ''))

CodePudding user response：

unichars = list(''.join(df["Preprocessed_Text"]))
print(sorted(set(unichars), key=unichars.index))

CodePudding user response：

unique = list(set([letter for letter in ''.join(df['Processed_text'].values) if letter != " "]))