Home > Software engineering >  Go - regex inside loop
Go - regex inside loop

Time:11-06

I have a file with a list of 600 regex patterns that most be performed in order to find a specific id for a website.

Example:

regex/www\.effectiveperformanceformat\.com/5
regex/bam-cell\.nr-data\.net/5
regex/advgoogle\.com/5
regex/googleapi\.club/5
regex/doubleclickbygoogle\.com/5
regex/googlerank\.info/5
regex/google-pr7\.de/5
regex/usemarketings\.com/5
regex/google-rank\.org/5
regex/googleanalytcs\.com/5
regex/xml\.trafficmoose\.com/5
regex/265\.com/5
regex/app-measurement\.com/5
regex/loftsbaacad\.com/5
regex/toldmeflex\.com/5
regex/r\.baresi\.xyz/5
regex/molodgytot\.biz/5
regex/ec\.walkme\.com/5
regex/px\.ads\.linkedin\.com/5
regex/hinisanex\.biz/5
regex/buysellads\.com/5
regex/buysellads\.net/5
regex/servedby-buysellads\.com/5
regex/carbonads\.(net|com)/5
regex/oulddev\.biz/5
regex/click\.hoolig\.app/5
regex/engine\.blacraft\.com/5
regex/mc\.yandex\.ru/5
regex/ads\.gaming1\.com/5
regex/adform\.net/5
regex/luzulabeguile\.com/5
regex/ficanportio\.biz/5
regex/hidelen\.com/5
regex/earchmess\.fun/5
regex/acrvclk\.com/5
regex/track\.wg-aff\.com/5
regex/thumb\.tapecontent\.net/5
regex/betgorebysson\.club/5
regex/in-page-push\.com/5
regex/itphanpytor\.club/5
regex/mktoresp\.com/5
regex/xid\.i-mobile\.co\.jp/5
regex/ads\.tremorhub\.com/5

So far what i'm using is something like this

for _, line := range file {
l := line
data := strings.Split(l, "/")
if data[0] == "regex" {
                match, _ := regexp.MatchString(`` data[1] ``, website)
                if match {
                    id, _ = strconv.Atoi(data[2])
                }
            }
}

This is working, but i wonder if there is a more optimized way to do this. Because, if the website match with the regex on the top, great, but if not, i need to intenered the loop over and over till find it.

Anyone can help me to improve this?

Best regards

CodePudding user response:

In order to reduce the time you can cache the regexp.

package main

import (
    "bufio"
    "bytes"
    "fmt"
    csvutils "github.com/alessiosavi/GoGPUtils/csv"
    "log"
    "os"
    "regexp"
    "strconv"
    "strings"
    "time"
)

func main() {
    now := time.Now()
    Precomputed("www.google.it")
    fmt.Println(time.Since(now))
    now = time.Now()
    NonPrecomputed("www.google.it")
    fmt.Println(time.Since(now))
}
func NonPrecomputed(website string) int {
    for _, line := range cachedLines {
        l := line
        data := strings.Split(l, "/")
        if data[0] == "regex" {
            match, _ := regexp.MatchString(`` data[1] ``, website)
            if match {
                id, _ := strconv.Atoi(data[2])
                return id
            }
        }
    }

    return -1
}
func Precomputed(site string) int {
    for regex, id := range rawRegex {
        if ok := regex.MatchString(site); ok {
            return id
        }
    }
    return -1
}

var rawRegex map[*regexp.Regexp]int = make(map[*regexp.Regexp]int)
var cachedLines []string
var sites []string

func init() {
    now := time.Now()
    file, err := os.ReadFile("regex.txt")
    if err != nil {
        panic(err)
    }

    scanner := bufio.NewScanner(bytes.NewReader(file))

    for scanner.Scan() {
        txt := scanner.Text()
        cachedLines = append(cachedLines, txt)
        split := strings.Split(txt, "/")
        if len(split) == 3 {
            compile, err := regexp.Compile(split[1])
            if err != nil {
                panic(err)
            }
            if rawRegex[compile], err = strconv.Atoi(split[2]); err != nil {
                panic(err)
            }
        }
    }
    file, err = os.ReadFile("top500Domains.csv")
    if err != nil {
        panic(err)
    }
    _, csvData, err := csvutils.ReadCSV(file, ',')
    if err != nil {
        panic(err)
    }
    for _, line := range csvData {
        sites = append(sites, line[1])
    }
    log.Println("Init took:", time.Since(now))
}

The init method take care of regexp cache. It will load all the regexp in a map with the relative index (it will load the test data too just for the benchmark).

Then you have 2 method:

  • Precomputed: use the map of cached regexp
  • NonPrecomputed: the copy->paste of your snippet

As you can see where the NonPrecomputed method is able to perform 63 execution, the Precomputed is able to perform 10000 execution. As you can see the NonPrecomputed method allocate ~67 MB when the Precomputed method have no allocation (due to the initial cache)

C:\opt\SP\Workspace\Go\Temp>go test -bench=. -benchmem -benchtime=10s
2022/11/03 00:45:35 Init took: 10.8397ms
goos: windows
goarch: amd64
pkg: Temp
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
Benchmark_Precomputed-8            10000           1113887 ns/op               0 B/op          0 allocs/op
Benchmark_NonPrecomputed-8            63         298434740 ns/op        65782238 B/op     484595 allocs/op
PASS
ok      Temp    41.548s
  • Related