How to exclude occurrences after a positive lookbehind?-CodePudding

Suppose I have the following markdown list items:

- [x] Example of a completed task.
- [x] ! Example of a completed task.
- [x] ? Example of a completed task.

I am interested to parse that item using regex and extract the following group captures:

$1: the left [ and the right ] brackets when the symbol x is in-between
$2: the symbol x in between the brackets [ and ]
$3: the modifier ! that follows after [x]
$4: the modifier ? that follows after [x]
$5: the text that follows [x] without a modifier, e.g., [x] This is targeted.
$6: the text that follows [x] !
$7: the text that follows [x] ?

After a lot of trial-and-error using online parsers, I came up with the following:

((?<=x)\]|\[(?=x]))|((?<=\[)x(?=\]))|((?<=\[x\]\s)!(?=\s))|((?<=\[x\]\s)\?(?=\s))|((?<=\[x\]\s)[^!?].*)|((?<=\[x\]\s!\s).*)|((?<=\[x\]\s\?\s).*)

To make the regex above more readable, these are the capture groups listed one by one:

$1: ((?<=x)\]|\[(?=x]))
$2: ((?<=\[)x(?=\]))
$3: ((?<=\[x\]\s)!(?=\s))
$4: ((?<=\[x\]\s)\?(?=\s))
$5: ((?<=\[x\]\s)[^!?].*)
$6: ((?<=\[x\]\s!\s).*)
$7: ((?<=\[x\]\s\?\s).*)

This is most likely not the best way to do it, but at least it seems to capture what I want:

I would like to extend that regex to capture lines in a markdown table that looks like this:

|       | Task name                               |    Plan     |   Actual    |      File      |
| :---- | :-------------------------------------- | :---------: | :---------: | :------------: |
| [x]   | Task one with a reasonably long name.   | 08:00-08:45 | 08:00-09:00 |  [[task-one]]  |
| [x] ! | Task two with a reasonably long name.   | 09:00-09:30 |             |  [[task-two]]  |
| [x] ? | Task three with a reasonably long name. | 11:00-13:00 |             | [[task-three]] |

More specifically, I am interested in having the same group captures as above, but I would like to exclude the table grid (i.e., the |). So, groups $1 to $4 should stay the same, but groups $5 to $7 should capture the text, excluding the |, e.g., like in the selection below:

Do you have any ideas on how I can adjust, for example, the regex for group $5 to exclude the |. I have endlessly tried all sorts of negations (e.g., [^\|]). I am using Oniguruma regular expressions.

CodePudding user response：

If I understood correctly, for a line like this

| [x] ! | Task two with a reasonably long name. | 09:00-09:30 | | [[task-two]] |

you want the output after the pipe sign [4 parts]

[x] !
Task two with a reasonably long name
09:00-09:30
[[task-two]]

Then maybe you can try the following

((?<=\|\s)([^|] ))

There are 2 parts to the expression.

(?<=\|\s) Positive look behind
([^|] ) Any character except the pipe. Which means it will greedily get hold of the text, time etc

CodePudding user response：

You can use

((?<=x)]|\[(?=x]))|((?<=\[)x(?=]))|((?<=\[x]\s)!(?=\s))|(?<=\[x]\s)(\?)(?=\s)|(?:\G(?!\A)\||(?<=\[x]\s[?!\s]\s\|))\K([^|\n]*)(?=\|)

See the regex101 PCRE and a Ruby (Onigmo/Oniguruma) demos.

What is added? The (?:\G(?!\A)\||(?<=\[x]\s[?!\s]\s\|))\K([^|\n]*)(?=\|) part:

(?: - start of a non-capturing group (a custom boundary here, we'll match...)
- \G(?!\A)\| - either the end of the previous match and a | char (i.e. | must immediately follow the previous match),
- |(?<=\[x]\s[?!\s]\s\|) - or a location that is immediately preceded with [x] a whitespace a ?, ! or whitespace a whitespace and | char
) - end of the group
\K - match reset operator that removes the text matched so far from the overall match memory buffer
([^|\n]*) - zero or more chars other than | and a line feed char
(?=\|) - a | char must appear immediately to the right of the current location.