Home > database >  How can I clean the text for search using RegEx
How can I clean the text for search using RegEx

Time:08-30

I can use the below code to search if the text str contains any or both of the keys, i.e.if it contains "MS" or "dynamics" or both of them

package main

import (
    "fmt"
    "regexp"
)

func main() {
    keys := []string{"MS", "dynamics"}
    keysReg := fmt.Sprintf("(%s %s)|%s|%s", keys[0], keys[1], keys[0], keys[1]) // => "(MS dynamics)|MS|dynamics"
    fmt.Println(keysReg)
    str := "What is MS dynamics, is it a product from MS?"
    re := regexp.MustCompile(`(?i)`   keysReg)
    matches := re.FindAllString(str, -1)
    fmt.Println("We found", len(matches), "matches, that are:", matches)
}

I want the user to enter his phrase, so I trim unwanted words and characters, then doing the search as per above. Let's say the user input was: This,is,a,delimited,string and I need to build the keys variable dynamically to be (delimited string)|delimited|string so that I can search for my variable str for all the matches, so I wrote the below:

    s := "This,is,a,delimited,string"
    t := regexp.MustCompile(`(?i),|\.|this|is|a`) // backticks are used here to contain the expression, (?i) for case insensetive
    v := t.Split(s, -1)
    fmt.Println(len(v))
    fmt.Println(v)

But I got the output as:

8
[      delimited string]

What is the wrong part in my cleaning of the input text, I'm expecting the output to be:

2
[delimited string]

Here is my playground

CodePudding user response:

To quote the famous quip from Jamie Zawinski,

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Two things:

  • Instead of trying to weed out garbage from the string ("cleaning" it), extract complete words from it instead.
  • Unicode is a compilcated matter; so even after you have succeeded with extracting words, you have to make sure your words are properly "escaped" to not contain any characters which might be interpreted as RE syntax before building a regexp of them.
package main

import (
    "errors"
    "fmt"
    "regexp"
    "strings"
)

func build(words ...string) (*regexp.Regexp, error) {
    var sb strings.Builder

    switch len(words) {
    case 0:
        return nil, errors.New("empty input")
    case 1:
        return regexp.Compile(regexp.QuoteMeta(words[0]))
    }

    quoted := make([]string, len(words))
    for i, w := range words {
        quoted[i] = regexp.QuoteMeta(w)
    }

    sb.WriteByte('(')
    for i, w := range quoted {
        if i > 0 {
            sb.WriteByte('\x20')
        }
        sb.WriteString(w)
    }
    sb.WriteString(`)|`)
    for i, w := range quoted {
        if i > 0 {
            sb.WriteByte('|')
        }
        sb.WriteString(w)
    }

    return regexp.Compile(sb.String())
}

var words = regexp.MustCompile(`\pL `)

func main() {
    allWords := words.FindAllString("\tThis\v\x20\x20,\t\tis\t\t,?a!,¿delimited?,string‽", -1)

    re, err := build(allWords...)
    if err != nil {
        panic(err)
    }

    fmt.Println(re)
}

Further reading:

  •  Tags:  
  • go
  • Related