Home > Back-end >  How to clean or convert these data values to gain word count distribution?
How to clean or convert these data values to gain word count distribution?

Time:09-22

I have pandas data frame with object type column that has profiling values per user like this:

print(df[profile_values]) 
   
1    [\n "ab",\n "abc",\n "abcd"\n]
1    NaN
3    [\n "ab",\n "abcd"\n]
4    NaN
5    [\n "ab"\n]
...

Need to trim values or change datatype so that i can count words to find distribution of said words. Like so:

word    count
ab      3
abc     1
abcd    2
...

Any advice?

EDIT: Values are string type

CodePudding user response:

if your dataframe is something like this:

df = pd.DataFrame(['[\n "ab",\n "abc",\n "abcd"\n]', np.nan, '[\n "ab",\n"abcd"\n]', np.nan, '[\n "ab"\n]'], columns=["profile_values"])

There are a few steps. Convert the literals to lists of strings, explode will convert your series of lists to a single series, and then count the repeated values:

from ast import literal_eval
counts = df.dropna()["profile_values"].apply(literal_eval).explode().value_counts()

output:

ab      3
abcd    2
abc     1
Name: profile_values, dtype: int64

CodePudding user response:

I would do three operations.

  1. Create a new column with a transformed values of the "profile_values" column into an actual list with:

    df['new_column'] = [eval(x) for x in df['profile_values']
    
  2. Create a list outside the dataframe, containing all the words in a single list:

    single_list = [x for y in df['new_column'] for x in y]
    
  3. Use a function to count frequencies of lists, like the following:

    import collections
    counter = collections.Counter(single_list)
    

Then you have the frequencies for every word on the list.

  • Related