I have pandas data frame with object type column that has profiling values per user like this:
print(df[profile_values])
1 [\n "ab",\n "abc",\n "abcd"\n]
1 NaN
3 [\n "ab",\n "abcd"\n]
4 NaN
5 [\n "ab"\n]
...
Need to trim values or change datatype so that i can count words to find distribution of said words. Like so:
word count
ab 3
abc 1
abcd 2
...
Any advice?
EDIT: Values are string type
CodePudding user response:
if your dataframe is something like this:
df = pd.DataFrame(['[\n "ab",\n "abc",\n "abcd"\n]', np.nan, '[\n "ab",\n"abcd"\n]', np.nan, '[\n "ab"\n]'], columns=["profile_values"])
There are a few steps. Convert the literals to lists of strings, explode
will convert your series of lists to a single series, and then count the repeated values:
from ast import literal_eval
counts = df.dropna()["profile_values"].apply(literal_eval).explode().value_counts()
output:
ab 3
abcd 2
abc 1
Name: profile_values, dtype: int64
CodePudding user response:
I would do three operations.
Create a new column with a transformed values of the "profile_values" column into an actual list with:
df['new_column'] = [eval(x) for x in df['profile_values']
Create a list outside the dataframe, containing all the words in a single list:
single_list = [x for y in df['new_column'] for x in y]
Use a function to count frequencies of lists, like the following:
import collections counter = collections.Counter(single_list)
Then you have the frequencies for every word on the list.