Regex: accurately match bold (**) and italics (*) item(s) from the input-CodePudding

I am trying to parse a markdown content with the use of regex. To grab bold and italic items from the input, I'm currently using a regex:

/(\*\*)(?<bold>[^**] )(\*\*)|(?<normal>[^`*[~] )|\*(?<italic>[^*] )\*/g

Regex101 Link: https://regex101.com/r/2zOMid/1

The problem with this regex are:

if there is a single * in between a bold text content, the match is breaked
if there are long texts like ******* anywhere in between the match is broken

#####: tried with: I tried removing the [^**] part in the bold group but that messed up the bold match with finding the last ** occurrence and including all `**`` chars within

What I want to have:

accurate bold
* allowed inside bold
accurate italics

Language: Javascript

Assumptions:

Bold text wrapped inside ** Italic text wrapped inside *

CodePudding user response：

[^**] will not avoid two consecutive *. It is a character class that is no different from [^*]. The repeated asterisk has no effect.

The pattern for italic should better come in front of the normal part, which should capture anything that remains. This could even be a sole asterisk (for example) -- the pattern for normal text should allow this.

It will be easier to use split and use the bold/italic pattern for matching the "delimiter" of the split, while still capturing it. All the rest will then be "normal". The downside of split is that you cannot benefit from named capture groups, but they will just be represented by separate entries in the returned array.

I will ignore the other syntax that markdown can have (like you seem to hint at with [ and ~ in your regex). On the other hand, it is important to deal well with backslash, as it is used to escape an asterisk.

Here is the regular expression (link):

(\*\*?)(?![\s\*])((?:[\s*]*(?:\\[\\*]|[^\\\s*])) ?)\1

Here is a snippet with two functions:

a function that first splits the input into tokens, where each token is a pair, like ["normal", " this is normal text "] and ["i", "text in italics"]
another function that uses these tokens to generate HTML

The snippet is interactive. Just type the input, and the output will be rendered in HTML using the above sequence.

function tokeniseMarkdown(s) {
    const regex = /(\*\*?)(?![\s\*])((?:[\s*]*(?:\\[\\*]|[^\\\s*])) ?)\1/gs;
    const styles = ["i", "b"];
    // Matches follow this cyclic order: 
    //   normal text, mark (= "*" or "**"), formatted text, normal text, ...
    const types = ["normal", "mark", ""];
    return s.split(regex).map((match, i, matches) =>
        types[i%3] !== "mark" && match &&
            [types[i%3] || styles[matches[i-1].length-1], 
             match.replace(/\\([\\*])/g, "$1")]
    ).filter(Boolean); // Exclude empty matches and marks
}

function tokensToHtml(tokens) {
    const container = document.createElement("span");
    for (const [style, text] of tokens) {
        let node = style === "normal" ? document.createTextNode(text) 
                                      : document.createElement(style);
        node.textContent = text;
        container.appendChild(node);
    }
    return container.innerHTML;
}


// I/O management
document.addEventListener("input", refresh);

function refresh() {
    const s = document.querySelector("textarea").value;
    const tokens = tokeniseMarkdown(s);
    document.querySelector("div").innerHTML = tokensToHtml(tokens);
}
refresh();

textarea { width: 100%; height: 6em }
div { font: 22px "Times New Roman" }

<textarea>**fi*rst b** some normal text here **second b**  *first i* normal *second i* normal again</textarea><br>

<div></div>

CodePudding user response：

Looking some more about the negative lookaheads, I came up with this regex:

/\*\*(?<bold>(?:(?!\*\*).) )\*\*|`(?<code>[^`] )`|~~(?<strike>(?:(?!~~).) )~~|\[(?<linkTitle>[^]] )]\((?<linkHref>.*)\)|(?<normal>[^`[*~] )|\*(?<italic>[^*] )\*|(?<tara>[*~]{3,})|(?<sitara>[`[] )/g

Regex101

this pretty much works for me as per my input scenarios. If someone has a more optimized regex, please comment.