How to clean or convert these data values to gain word count distribution?-CodePudding

I have pandas data frame with object type column that has profiling values per user like this:

print(df[profile_values]) 
   
1    [\n "ab",\n "abc",\n "abcd"\n]
1    NaN
3    [\n "ab",\n "abcd"\n]
4    NaN
5    [\n "ab"\n]
...

Need to trim values or change datatype so that i can count words to find distribution of said words. Like so:

word    count
ab      3
abc     1
abcd    2
...

Any advice?

EDIT: Values are string type

CodePudding user response：

if your dataframe is something like this:

df = pd.DataFrame(['[\n "ab",\n "abc",\n "abcd"\n]', np.nan, '[\n "ab",\n"abcd"\n]', np.nan, '[\n "ab"\n]'], columns=["profile_values"])

There are a few steps. Convert the literals to lists of strings, explode will convert your series of lists to a single series, and then count the repeated values:

from ast import literal_eval
counts = df.dropna()["profile_values"].apply(literal_eval).explode().value_counts()

output:

ab      3
abcd    2
abc     1
Name: profile_values, dtype: int64

CodePudding user response：

I would do three operations.

Create a new column with a transformed values of the "profile_values" column into an actual list with:
```
df['new_column'] = [eval(x) for x in df['profile_values']
```
Create a list outside the dataframe, containing all the words in a single list:
```
single_list = [x for y in df['new_column'] for x in y]
```
Use a function to count frequencies of lists, like the following:
```
import collections
counter = collections.Counter(single_list)
```

Then you have the frequencies for every word on the list.