I have a file with a list of 600 regex patterns that most be performed in order to find a specific id for a website.
Example:
regex/www\.effectiveperformanceformat\.com/5
regex/bam-cell\.nr-data\.net/5
regex/advgoogle\.com/5
regex/googleapi\.club/5
regex/doubleclickbygoogle\.com/5
regex/googlerank\.info/5
regex/google-pr7\.de/5
regex/usemarketings\.com/5
regex/google-rank\.org/5
regex/googleanalytcs\.com/5
regex/xml\.trafficmoose\.com/5
regex/265\.com/5
regex/app-measurement\.com/5
regex/loftsbaacad\.com/5
regex/toldmeflex\.com/5
regex/r\.baresi\.xyz/5
regex/molodgytot\.biz/5
regex/ec\.walkme\.com/5
regex/px\.ads\.linkedin\.com/5
regex/hinisanex\.biz/5
regex/buysellads\.com/5
regex/buysellads\.net/5
regex/servedby-buysellads\.com/5
regex/carbonads\.(net|com)/5
regex/oulddev\.biz/5
regex/click\.hoolig\.app/5
regex/engine\.blacraft\.com/5
regex/mc\.yandex\.ru/5
regex/ads\.gaming1\.com/5
regex/adform\.net/5
regex/luzulabeguile\.com/5
regex/ficanportio\.biz/5
regex/hidelen\.com/5
regex/earchmess\.fun/5
regex/acrvclk\.com/5
regex/track\.wg-aff\.com/5
regex/thumb\.tapecontent\.net/5
regex/betgorebysson\.club/5
regex/in-page-push\.com/5
regex/itphanpytor\.club/5
regex/mktoresp\.com/5
regex/xid\.i-mobile\.co\.jp/5
regex/ads\.tremorhub\.com/5
So far what i'm using is something like this
for _, line := range file {
l := line
data := strings.Split(l, "/")
if data[0] == "regex" {
match, _ := regexp.MatchString(`` data[1] ``, website)
if match {
id, _ = strconv.Atoi(data[2])
}
}
}
This is working, but i wonder if there is a more optimized way to do this. Because, if the website match with the regex on the top, great, but if not, i need to intenered the loop over and over till find it.
Anyone can help me to improve this?
Best regards
CodePudding user response:
In order to reduce the time you can cache the regexp.
package main
import (
"bufio"
"bytes"
"fmt"
csvutils "github.com/alessiosavi/GoGPUtils/csv"
"log"
"os"
"regexp"
"strconv"
"strings"
"time"
)
func main() {
now := time.Now()
Precomputed("www.google.it")
fmt.Println(time.Since(now))
now = time.Now()
NonPrecomputed("www.google.it")
fmt.Println(time.Since(now))
}
func NonPrecomputed(website string) int {
for _, line := range cachedLines {
l := line
data := strings.Split(l, "/")
if data[0] == "regex" {
match, _ := regexp.MatchString(`` data[1] ``, website)
if match {
id, _ := strconv.Atoi(data[2])
return id
}
}
}
return -1
}
func Precomputed(site string) int {
for regex, id := range rawRegex {
if ok := regex.MatchString(site); ok {
return id
}
}
return -1
}
var rawRegex map[*regexp.Regexp]int = make(map[*regexp.Regexp]int)
var cachedLines []string
var sites []string
func init() {
now := time.Now()
file, err := os.ReadFile("regex.txt")
if err != nil {
panic(err)
}
scanner := bufio.NewScanner(bytes.NewReader(file))
for scanner.Scan() {
txt := scanner.Text()
cachedLines = append(cachedLines, txt)
split := strings.Split(txt, "/")
if len(split) == 3 {
compile, err := regexp.Compile(split[1])
if err != nil {
panic(err)
}
if rawRegex[compile], err = strconv.Atoi(split[2]); err != nil {
panic(err)
}
}
}
file, err = os.ReadFile("top500Domains.csv")
if err != nil {
panic(err)
}
_, csvData, err := csvutils.ReadCSV(file, ',')
if err != nil {
panic(err)
}
for _, line := range csvData {
sites = append(sites, line[1])
}
log.Println("Init took:", time.Since(now))
}
The init
method take care of regexp cache. It will load all the regexp in a map with the relative index (it will load the test data too just for the benchmark).
Then you have 2 method:
Precomputed
: use the map of cached regexpNonPrecomputed
: the copy->paste of your snippet
As you can see where the NonPrecomputed
method is able to perform 63 execution, the Precomputed
is able to perform 10000 execution.
As you can see the NonPrecomputed
method allocate ~67 MB when the Precomputed
method have no allocation (due to the initial cache)
C:\opt\SP\Workspace\Go\Temp>go test -bench=. -benchmem -benchtime=10s
2022/11/03 00:45:35 Init took: 10.8397ms
goos: windows
goarch: amd64
pkg: Temp
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
Benchmark_Precomputed-8 10000 1113887 ns/op 0 B/op 0 allocs/op
Benchmark_NonPrecomputed-8 63 298434740 ns/op 65782238 B/op 484595 allocs/op
PASS
ok Temp 41.548s