Home > other >  Counting text elements in pandas DataFrame where items are not strictly unique
Counting text elements in pandas DataFrame where items are not strictly unique

Time:07-29

I am attempting to count specific words or phrases per rows in a DataFrame, but the string characters have overlap and are not strictly unique. Consider these set of strings I want counts per each per row:

Set of Strings

'burgers','burgers legacy','burgers V2'

Now consider these lines of data as a DataFrame column with "Text" as the column name:

Text
burgers V2 satisfies these requirment 1; burgers is sufficient for requirements 2; burgers legacy is on the road map for  requirement 3; burgers answers requements 4
burgers  satisifes requirement 1; burgers is on the roadmap for requirement 2; burgers V2  satisfies requirement 3; burgers legacy satisfies requirement 4
burgers legacy satisfies requirements 1; burgers V2 satisifies requirements 2; burgers V2 satisifes requiremetns 3

The problem becomes pretty obvious: as a string, 'burgers' is a subset of the other two items. So doing something like this results in over-counting 'burgers':

df['burgers'] = df['Text'].str.count('burgers')
df['burgers legacy'] = df['Text'].str.count('burgers legacy')
df['burgers V2'] = df['Text'].str.count('burgers V2')

A few things of note which are likely important. If it is 'burgers' without the following characters (including the white space) ' V2' or ' legacy' then it represents something to be uniquely counted.

However, I cannot rely on a consistent trailing string character or word after 'burgers' to delineate its uniqueness, but can only rely on knowing it is unique by the immediate absence of either ' V2' or ' legacy'.

How do I get an accurate count for the set of strings per row?

CodePudding user response:

You can try using regular expression to extract various alternatives to string "burgers":

tmp = df["Text"].str.findall(r"burgers\s*(legacy|V2)?").explode()
df = pd.concat([df, pd.crosstab(tmp.index, tmp).add_prefix("burgers ")], axis=1)
df.columns = map(str.strip, df.columns)
print(df)

Prints:

                                                                                                                                                                    Text  burgers  burgers V2  burgers legacy
0  burgers V2 satisfies these requirment 1; burgers is sufficient for requirements 2; burgers legacy is on the road map for  requirement 3; burgers answers requements 4        2           1               1
1             burgers  satisifes requirement 1; burgers is on the roadmap for requirement 2; burgers V2  satisfies requirement 3; burgers legacy satisfies requirement 4        2           1               1
2                                                     burgers legacy satisfies requirements 1; burgers V2 satisifies requirements 2; burgers V2 satisifes requiremetns 3        0           2               1
  • Related