Home > database >  Split blocks of data with the same title using regex
Split blocks of data with the same title using regex

Time:06-09

I have a long string that is build like that:

[[title]]
a = "1"
b = "1"
c = "1"
d = "1"
e = [
 "1",
 "1",
]

[[title]]
a = "2"
b = "2"
c = "2"
d = "2"
e = [
 "2",
]

[[title]]
a = "a3"
b = "3"
c = "3"

[[title]]
a = "a4"
b = "4"
c = "4"
e = [
 "4",
]

My target is to extract the text inside each title (without the title) and put it into a slice. I've tried to use the attributes keys (like d and e) but sometimes they don't exist.

You can look in my regex below:

(?m)(((\[\[title]]\s*\n)(?:^. $\n) ?)(d.*?$)(\s*e(.|\n)*?])?)

I want to find a way to extract the data between each title until \n or end of string

Edition:

I'm using GO so I can't use look around \ behind syntax

Thanks!

CodePudding user response:

You can use the following pattern that matches from [[title]] to an empty line.

`\[\[title]](.*?)^$`gms

Explanation

  • \[\[title]] Match [[title]]
  • ( Capturing group
    • .*? Non-greedy match till next match
  • ) Close group
  • ^$ Using m (multiline) flag this means an empty line

See the demo with the Golang regex engine

CodePudding user response:

This seems to work. It's not as simple or elegant as @ArtyomVancyan's answer, although it has the little advantage that it doesn't need a newline at the end of the expression:

[Demo]

(?m)(?:\[\[title]]\n((?:.*\n) ?(?:\]|^$))) 

Explanation:

  • (?m): multi line modifier.
  • (?:\[\[title]]\n(<text until next closing square bracket or blank line>)) : find one or more blocks starting with [[title]]\n and followed by <text until next closing square bracket or blank line>, and capture those texts.
  • (?:.*\n) ?(?:\]|^$): two consecutive non-capturing subgroups; the first one is a bunch of lines, (?:.*|n) , non-greedy, ?; and the second one is either a closing square bracket, ], or an empty line, ^$. That is, a bunch of lines ending either in the first line line containing a closing square bracket or a blank line.

CodePudding user response:

Instead of making a regex which seems fraught with perils, you'll probably be better served by just building a custom parser for your custom format, or you may find you can repurpose an implementation of an INI configparser

If the titles are always defined as being within pairs of [[]] and at the start of a block, you could use a regex to find them, but only to separate them out

If you're not interested in the content (surely the next step is that you are) and you're sure the structure is as simple as you show, you could also just directly split twice on these instead

>>> long_string_config = """ """  # input data omitted for brevity
>>> for block in filter(None, (a.split("]]")[-1].strip() for a in long_string_config.split("[["))):
...    print("---")
...    print(block)
...
---
a = "1"
b = "1"
c = "1"
d = "1"
e = [
 "1",
 "1",
]
---
a = "2"
b = "2"
c = "2"
d = "2"
e = [
 "2",
]
---
a = "a3"
b = "3"
c = "3"
---
a = "a4"
b = "4"
c = "4"
e = [
 "4",
]

CodePudding user response:

You might use a pattern to repeat the possible format of the lines under the title part.

The lines start with word characters followed by = and then either a part "..." or [...]

\[\[title]]((?:\r?\n\w \s*=\s*(?:"[^"]*"|\[[^\]\[]*]))*)

Explanation

  • \[\[title]] Match [[title]]
  • ( Capture group 1
    • (?: Non capture group
      • \r?\n Match a newline
      • \w \s*=\s* Match 1 word chars and = between optional whitspace chars
      • (?: Non capture group for the alternatives
        • "[^"]*" Match from "..."
        • | Or
        • \[[^\]\[]*] match from [...]
      • ) Close non capture group
    • )* Close non capture group and optionally repeat
  • ) Close group 1

Regex demo

  • Related