Following off this question I am wanting a count of specific words or phrases per rows in a DataFrame that are extremely similar and such that there is a subset string to be searched that is considered unique. When expanding and diversifying the criteria of strings to be searched, I ran into trouble and was not able to clearly modify or generalize for the solution. The set of strings I'm looking for is as follows:
Set of Strings
'burgers',
'burgers.extreme',
'burgers legacy',
'burgers 2',
'burgers V2',
'greasy burgers',
'ultimate burgers',
'chewy burgers',
'rotten burgers'
Expected Output
Additional columns representing counts per each of those strings appended onto the DataFrame with the strings as their new, respective column names. This will show the counts of each of those strings per row.
Data: DataFrame column with "Text" as the column name:
Text
burgers V2 satisfies these requirements 1; burgers is sufficient for requirements 2; burgers legacy is on the road map for requirement 3; burgers answers requirements 4
burgers satisfies requirement 1; burgers is on the roadmap for requirement 2; burgers V2 satisfies requirement 3; burgers legacy satisfies requirement 4
burgers legacy satisfies requirements 1; burgers V2 satisfies requirements 2; burgers V2 satisfies requirements 3
greasy burgers do not meet requirements 1; burgers satisfies requirements 2; burgers.extreme meets requirement 3; ultimate burgers satisfies requirements 4
chewy burgers satisfies requirements 1; rotten burgers satisfies requirements 2; burgers legacy satisfies requirement 3; burgers legacy satisfies requirements 4
burgers 2 satisfies requirements 1; chewy burgers satisfies requirements 2; greasy burgers satisfies requirements 3; greasy burgers satisfies requirements 4
Issues to Take Note
- As in the other question linked above, the obvious problem is that the string 'burgers' is a subset of the other strings. So doing something like
df['burgers'] = df['Text'].str.count('burgers')
results in over-counting 'burgers'. - To delineate and count 'burgers' as its own entity I believe I have to rely on what is not immediately surrounding that string. ie if 'burgers is not immediately preceded by
'greasy ','ultimate ','chewy ','rotten '
or not immediately followed by'.extreme', ' legacy', ' 2', ' V2'
then it should be considered and counted as a unique item in that row. - All the above prefixes and suffixes indicating unique items contain a space (white space) except
'.extreme'
indicating the item'burgers.extreme'
is its own unique item and does not have a space like the others.
What I attempted
Using Andrej's wonderful answer as a starting point, I attempted to modify it to a broader category of strings, but have been running into trouble with the regex and generalizing it. Such as trying:
tmp = df["Text"].str.findall(r"(greasy|ultimate|chewy|rotten)?\s*burgers\s*(legacy|V2|2)?").explode())
df = pd.concat([df, pd.crosstab(tmp.index, tmp).add_prefix("burgers ")], axis=1)
df.columns = map(str.strip, df.columns)
but adding (greasy|ultimate|chewy|rotten)?\s*
returns TypeError: Cannot broadcast np.ndarray with operand of type <class 'list'>
, nor does it make sense to label the columns with .add_prefix("burgers ")]
since there is now text being looked at at either side. It also just omits 'burgers.extreme'
as well.
I thought that perhaps doing it piece-wise might be the way to go, something like
tmp1 = df["Text"].str.findall(r"burgers\s*(legacy|V2|2)?").explode()
tmp2 = df["Text"].str.findall(r"(greasy|ultimate|chewy|rotten)?\s*burgers").explode()
tmp3 = df["Text"].str.findall(r"burgers.extreme").explode()
but I'm not able to merge or concatenate the three in a way that allows for the crosstab to work and returns a dimensionality error. It also seems like a very inefficient approach, as well.
What can I do to resolve this? Or what generalized approach can I take that allows for the search and counting of string per row where there is a unique subset of string characters as one of the elements?
CodePudding user response:
Regular expression or
statements will short-circuit. Thus, as long as you put burgers
last, you will not experience double matching.
We can do this as follows:
targets = [
"burgers.extreme",
"burgers legacy",
"burgers 2",
"burgers V2",
"greasy burgers",
"ultimate burgers",
"chewy burgers",
"rotten burgers",
"burgers",
]
pattern = "|".join(f"({item})" for item in targets)
# (burgers.extreme)|(burgers legacy)|(burgers 2)|(burgers V2)|(greasy burgers)|(ultimate burgers)|(chewy burgers)|(rotten burgers)|(burgers)
and we can build up our answer as follows:
df[targets] = (
df["Text"]
.str.extractall(pattern)
.droplevel(axis=0, level=1)
.reset_index()
.groupby("index")
.count()
)
Which gives:
Text burgers.extreme burgers legacy burgers 2 burgers V2 greasy burgers ultimate burgers chewy burgers rotten burgers burgers
0 burgers V2 satisfies these requirements 1; bur... 0 1 0 1 0 0 0 0 2
1 burgers satisfies requirement 1; burgers is o... 0 1 0 1 0 0 0 0 2
2 burgers legacy satisfies requirements 1; burge... 0 1 0 2 0 0 0 0 0
3 greasy burgers do not meet requirements 1; bur... 1 0 0 0 1 1 0 0 1
4 chewy burgers satisfies requirements 1; rotten... 0 2 0 0 0 0 1 1 0
5 burgers 2 satisfies requirements 1; chewy burg... 0 0 1 0 2 0 1 0 0
So what's going on here:
extractall
adds a new row for each match (notably adding a multi-index where the first index represents the index in the originaldf
, and the second represents the index of the match). This row is allNaN
except for the matching value.- As I mentioned,
extractall
is adding an index for match number. We don't need this, we just care about what row the match was in in the originaldf
, thus we drop the index of the match using.droplevel(axis=0, level=1)
- To run our
groupby
, we reset the index - Now we group by
index
, which is the index from the original dataframe - Now we count the number of occurrences of each match for each index in the original dataframe
And that gives you your result.
Important caveat
Regex also will find the earliest matches. As an example, this means that regardless of the ordering of your targets
, if your text contains greasy burgers V2
, the regular expression will find greasy burgers
, not burgers
and not burgers V2
.