Home > OS >  Stuck in regex/markdown replacement issue (double matches and replace using data from the matches)
Stuck in regex/markdown replacement issue (double matches and replace using data from the matches)

Time:08-24

So here is my problem

Sample input (file.txt)

all chars possible

[[example.org]][1500]

all chars possible

[[example.org]][1318]

all chars possible

[1318]: https://web.example.org/web/https://example.com
[1500]: https://web.example.org/web/https://example.net

all chars possible

Sample desired output (file.txt)

all chars possible

[[example.org]](https://web.example.org/web/https://example.net)

all chars possible

[[example.org]](https://web.example.org/web/https://example.com)

all chars possible

(basically again every special char)

For the first match, brackets are constant, number is variable and can be 1-4 numbers in length. [[example.org]][variable number]

For the second match brackets are constant, semi colon as well, number is variable 1-4 in length and matching the previous variable number (reference in markdown),followed by a space and then any URL possible, http with any URL char possible BUT always starts with https://web.example.org/web/)

So I want to replace things as follows

[[example.org]][1500]

By (for every matching number without being hampered by the ton of special chars in the file)

[[example.org]](URL)

In this case

[[example.org]](https://web.exampe.org/web/https://example.net)

And then remove the [1500]: https://web.example.org/web/https://example.net after modification

Ending with:

(tons of special chars/text/url)

[[example.org]](https://web.example.org/web/https://example.net)

(tons of special chars/text/url)

The first regex I made is:

\[\[example.org\]\]\[([0-9]{1,4})\]

Capture group 1 getting [1500]

The second regex I made is:

[1500\]: (.*)

Capture group getting the URL in a dirty way

Yet it should be using the capture group 1 from above so ended with this

\1: (.*)

And I want to end up with only the first matched regex as follows instead of a numbered reference. I'm stuck here.

[[example.org]](captured URL)

Followed by removal of the entire line (no space left)

[1500]: URL

I tried some JS but got stuck by all the special chars in the document breaking things. Any help would be greatly appreciated. I don't know if/how this is possible with a simple search/replace using regexes. And I'm sure there must be an easy way to do this maybe using sed. This is for help for our open-source project and I admit I'm not the best at this.

Actual real file

CodePudding user response:

Using GNU sed

$ sed -Ez ':a;s/(\[\[[^[]*)(\[[^\n]*)(.*)\2: ([^\n]*)(\n.*$)?/\1(\4)\3/;ta' file.txt
all chars possible

[[example.org]](https://web.example.org/web/https://example.net)

all chars possible

[[example.org]](https://web.example.org/web/https://example.com)

all chars possible

CodePudding user response:

An awk solution will be a better fit here:

awk -F ': ' '
FNR==NR {
   if (NF==2)
      map[$1] = $2
   next
}
$2 in map {
   $2 = "]](" map[$2] ")"
}
!($1 in map)
' file FS=']]|: ' file

all chars possible

[[example.org ]](https://web.example.org/web/https://example.net)

all chars possible

[[example.org ]](https://web.example.org/web/https://example.com)

all chars possible
  • Related