Home > OS >  Match and extract href info using regex
Match and extract href info using regex

Time:10-08

I am trying to make a regex that match and extract href link information in more than one case, for example both with double, single and no quotation mark in Swift.

A regex to match href and extract info <a href=https://www.google.com>Google</a>.
<a href="https://www.google.com">Google</a> 
<a href='https://www.google.com'>Google</a>

I have found this regex, but it only works with double quotation:

<a href="([^"] )">([^<] )<\/a>

Result:

Match 1: <a href="https://www.google.com">Google</a>
Group 1: https://www.google.com
Group 2: Google

What I want is to detect all of the three ways that I provided with the sample text.

Note: I know that regex shouldn't be used for parsing HTML, but I am using it for a very small use case so it's fine.

CodePudding user response:

Answer is already in comments but posting this since the approach is bit different.

In swift 5.7 & iOS 16 u can use regexBuilder for this.

import RegexBuilder


var link1 = "A regex to match href and extract info <a href=https://www.google.com>Google</a>."
var link2 = "<a href='\"https://www.google.com\">Google</a>"
var link3 = "<a href='https://www.google.com'>Google</a>"

let regex = Regex {
    Capture {
        "https://www."
        ZeroOrMore(.word)
        "."
        ZeroOrMore(.word)
    }

}

if let result1 = try? regex.firstMatch(in: link1) {
    print("link: \(result1.output.1)")
}

if let result2 = try? regex.firstMatch(in: link2) {
    print("link: \(result2.output.1)")
}

if let result3 = try? regex.firstMatch(in: link3) {
    print("link: \(result3.output.1)")
}

This work well for the above 3 provided strings. But depend on the scenarios u might need to change the implementation.

CodePudding user response:

assuming there is no other attribute in anchor tags in the file you wish to parse, you can use the following regex : /<a href=('|"|)([^'">] )\1>([^<] )<\/a>/$2 $3/gm.

It first captures either single quote, double quote or nothing and then \1 recalls that capturing group, watch it live here on regex101.

  • Related