Split string with a certain keyword outside a string but not inside a string-CodePudding

I have a question about how to use regex at this condition (or can be in any solution in Python):

What I want to achieve is to split the colon ':' if it's found outside a string, but don't split it if it's inside a string, like this example below:

Regex I use: (?!\B"[^"]*):(?![^"]*"\B)

string_to_split: str = '"A: String 1": "B: String 2": C: "D: String 4"'

Output > ["A: String 1", "B: String 2", 'C', "D: String 4"]

It got what I've expected, but somehow it won't work if I put anything in front of a string that is not in a letter or a number (somehow, it won't be split by regex if in front of a string are symbols/spaces, etc) like this one:

string_to_split: str = '"A: String 1": "B: String 2": C: " D: String 4"' (space before letter 'D')

Output > ["A: String 1", "B: String 2": C: " D: String 4"]

The reason why I do this is that I want to get more comfortable using regex in Python (I barely use regex when coding), so I think it might have to use look-ahead or look-behind but don't know really much about it... I really appreciate you guys if you got into some sort of solution for this, thank you...

CodePudding user response：

You may also using TTP template to parse your all of datas without Regex. I made an example by putting a space at the beginning as you indicated in the example "D". In short, you can easily obtain data by creating your own templates with TTP.

from ttp import ttp
import json

string_to_split = "  D: String 4"

ttp_template_0 ="""
  {{D:}} {{String}} {{4}}
"""
parser_0 = ttp(data=string_to_split, template=ttp_template_0)
parser_0.parse()

#print result in JSON format
results_0 = parser_0.result(format='json')[0]
print(results_0)

#str to list **convert with json.loads
result_0 = json.loads(results_0)
print(result_0[0])

See the following parsed data:
[
    {
        "4": "4",
        "D:": "D:",
        "String": "String"
    }
]
{'4': '4', 'D:': 'D:', 'String': 'String'}

CodePudding user response：

Would you please try the following:

import re

pat='(?:[^:]*"[^"] "[^:]*)|[^:] '
str = '"A: String 1": "B: String 2": C: " D: String 4"'

m = [x.strip() for x in re.findall(pat, str)]
#m = [x.strip('" ') for x in re.findall(pat, str)]      # removes double quotes too
print(m)

Output:

['"A: String 1"', '"B: String 2"', 'C', '" D: String 4"']

The regex pat matches any sequences of characters other than a colon, while allowing the existence of colons within the double quotes.
The regex leaves the leading/trailing whitespaces, which is then removed by strip().

If you want to remove the surrounding double quotes as well, apply strip('" ') instead. Then the output will be:

['A: String 1', 'B: String 2', 'C', 'D: String 4']