Regex match hyphenated word with hyphen-less query-CodePudding

I have an Azure Storage Table set up that possesses lots of values containing hyphens, apostrophes, and other bits of punctuation that the Azure Indexers don't like. Hyphenated-Word gets broken into two tokens — Hyphenated and Word — upon indexing. Accordingly, this means that searching for HyphenatedWord will not yield any results, regardless of any wildcard or fuzzy matching characters. That said, Azure Cognitive Search possesses support for Regex Lucene queries...

As such, I'm trying to find out if there's a Regex pattern I can use to match words with or without hyphens to a given query. As an example, the query homework should match the results homework and home-work.

I know that if I were trying to do the opposite — match unhyphenated words even when a hyphen is provided in the query — I would use something like /home(-)?work/. However, I'm not sure what the inverse looks like — if such a thing exists.

Is there a raw Regex pattern that will perform the kind of matching I'm proposing? Or am I SOL?

Edit: I should point out that the example I provided is unrealistic because I won't always know where a hyphen should be. Optimally, the pattern that performs this matching would be agnostic to the precise placement of a hyphen.

Edit 2: A solution I've discovered that works but isn't exactly optimal (and, though I have no way to prove this, probably isn't performant) is to just break down the query, remove all of the special characters that cause token breaks, and then dynamically build a regex query that has an optional match in between every character in the query. Using the homework example, the pattern would look something like [-'\.! ]?h[-'\.! ]?o[-'\.! ]?m[-'\.! ]?e[-'\.! ]?w[-'\.! ]?o[-'\.! ]?r[-'\.! ]?k[-'\.! ]?...which is perhaps the ugliest thing I've ever seen. Nevertheless, it gets the job done.

CodePudding user response：

My solution to scenarios like this is always to introduce content- and query-processing.

Content processing is easier when you use the push model via the SDK, but you could achieve the same by creating a shadow/copy of your table where the content is manipulated for indexing purposes. You let your original table stay intact. And then you maintain a duplicate table where your text is processed.

Query processing is something you should use regardless. In its simplest form you want to clean the input from the end users before you use it in a query. Additional steps can be to handle special characters like a hyphen. Either escape it, strip it, or whatever depending on what your requirements are.

EXAMPLE

I have to support searches for ordering codes that may contain hyphens or other special characters. The maintainers of our ordering codes may define ordering codes in an inconsistent format. Customers visiting our sites are just as inconsistent.

The requirement is that ABC-123-DE_F-4.56G should match any of

ABC-123-DE_F-4.56G
ABC123-DE_F-4.56G
ABC_123_DE_F_4_56G
ABC.123.DE.F.4.56G
ABC 123 DEF 56 G
ABC123DEF56G

I solve this using my suggested approach above. I use content processing to generate a version of the ordering code without any special characters (using a simple regex). Then, I use query processing to transform the end user's input into an OR-query, like:

<verbatim-user-input-cleaned> OR OrderCodeVariation:<verbatim-user-input-without-special-chars>

So, if the user entered ABC.123.DE.F.4.56G I would effecively search for

ABC.123.DE.F.4.56G OR OrderingCodeVariation:ABC123DEF56G

CodePudding user response：

It sounds like you want to define your own tokenization. Would using a custom tokenizer help? https://docs.microsoft.com/azure/search/index-add-custom-analyzers

CodePudding user response：

To add onto Jennifer's answer, you could consider using a custom analyzer consisting of either of these token filters:

pattern_replace: A token filter which applies a pattern to each token in the stream, replacing match occurrences with the specified replacement string.
pattern_capture: Uses Java regexes to emit multiple tokens, one for each capture group in one or more patterns.

You could use the pattern_replace token filter to replace hyphens with the desired character, maybe an empty character.