I am attempting to count specific words or phrases per rows in a DataFrame, but the string characters have overlap and are not strictly unique. Consider these set of strings I want counts per each per row:
Set of Strings
'burgers','burgers legacy','burgers V2'
Now consider these lines of data as a DataFrame column with "Text" as the column name:
Text
burgers V2 satisfies these requirment 1; burgers is sufficient for requirements 2; burgers legacy is on the road map for requirement 3; burgers answers requements 4
burgers satisifes requirement 1; burgers is on the roadmap for requirement 2; burgers V2 satisfies requirement 3; burgers legacy satisfies requirement 4
burgers legacy satisfies requirements 1; burgers V2 satisifies requirements 2; burgers V2 satisifes requiremetns 3
The problem becomes pretty obvious: as a string, 'burgers' is a subset of the other two items. So doing something like this results in over-counting 'burgers':
df['burgers'] = df['Text'].str.count('burgers')
df['burgers legacy'] = df['Text'].str.count('burgers legacy')
df['burgers V2'] = df['Text'].str.count('burgers V2')
A few things of note which are likely important. If it is 'burgers'
without the following characters (including the white space) ' V2'
or ' legacy'
then it represents something to be uniquely counted.
However, I cannot rely on a consistent trailing string character or word after 'burgers'
to delineate its uniqueness, but can only rely on knowing it is unique by the immediate absence of either ' V2'
or ' legacy'
.
How do I get an accurate count for the set of strings per row?
CodePudding user response:
You can try using regular expression to extract various alternatives to string "burgers":
tmp = df["Text"].str.findall(r"burgers\s*(legacy|V2)?").explode()
df = pd.concat([df, pd.crosstab(tmp.index, tmp).add_prefix("burgers ")], axis=1)
df.columns = map(str.strip, df.columns)
print(df)
Prints:
Text burgers burgers V2 burgers legacy
0 burgers V2 satisfies these requirment 1; burgers is sufficient for requirements 2; burgers legacy is on the road map for requirement 3; burgers answers requements 4 2 1 1
1 burgers satisifes requirement 1; burgers is on the roadmap for requirement 2; burgers V2 satisfies requirement 3; burgers legacy satisfies requirement 4 2 1 1
2 burgers legacy satisfies requirements 1; burgers V2 satisifies requirements 2; burgers V2 satisifes requiremetns 3 0 2 1