how to extract data from nested parenthesis?-CodePudding

I have a string:

test_string = 'I(30TCH(50EDFva_25VAP_25SNE)_20UDS(80EDFvd_10VAP_10SNE)_20EDU(SNE)_10UDS(80EDFva_10VAP_10SNE)_10EDU(50EDFva_50VAP)_10EDP(50EDFva_50SNE))'

I need to extract the data from the string and the final result should look like that:

I,
30TCH:50EDFva, 25VAP, 25SNE,
20UDS:80EDFvd, 10VAP, 10SNE
....

and so on..

I thought using regex but it is not good solution here..

CodePudding user response：

This seems to get you (almost) there -

[_.replace("(", ": ").replace("_", ", ") for _ in re.split(r"\)_", test_string)]

Output

['I: 30TCH: 50EDFva, 25VAP, 25SNE',
 '20UDS: 80EDFvd, 10VAP, 10SNE',
 '20EDU: SNE',
 '10UDS: 80EDFva, 10VAP, 10SNE',
 '10EDU: 50EDFva, 50VAP',
 '10EDP: 50EDFva, 50SNE))']

CodePudding user response：

I think we may need a little more clarification on the logic. It looks like ( should translate into a :, but not every time. Here is my crack at it using regexes. This might not be exactly what you are looking for, but should be pretty close:

import re

def main():
    test_string = 'I(30TCH(50EDFva_25VAP_25SNE)_20UDS(80EDFvd_10VAP_10SNE)_20EDU(SNE)_10UDS(80EDFva_10VAP_10SNE)_10EDU(50EDFva_50VAP)_10EDP(50EDFva_50SNE))'
    
    test_string = re.sub("\)_", ",\n", test_string)
    test_string = re.sub("_", ",", test_string)
    test_string = re.sub("\(", ":", test_string)
    test_string = re.sub("\)\)", "", test_string)

    print(test_string)

if __name__ == "__main__":
    main()

results:

I:30TCH:50EDFva,25VAP,25SNE,
20UDS:80EDFvd,10VAP,10SNE,
20EDU:SNE,
10UDS:80EDFva,10VAP,10SNE,
10EDU:50EDFva,50VAP,
10EDP:50EDFva,50SNE

Pretty much just a series of regexes. Note that by using re.sub like this in an order, you clean the string as you go. You could certainly just fiddle the beginning of the string to change the first : to a ,\n but I'm not sure if the data you are getting in is always the same.

CodePudding user response：

Regex will work fine. After you remove the outer I(), you have a many sets of "prefix" followed by a (group_of_data)

If you don't want trailing commas, try this

import re

regex = r"[^(] \([^)] \)"

s = 'I(30TCH(50EDFva_25VAP_25SNE)_20UDS(80EDFvd_10VAP_10SNE)_20EDU(SNE)_10UDS(80EDFva_10VAP_10SNE)_10EDU(50EDFva_50VAP)_10EDP(50EDFva_50SNE))'

first_start = s.index('(')
print(s[:first_start])

matches = re.finditer(regex, s[first_start 1:-1], re.MULTILINE)

for _, match in enumerate(matches, start=1):
  g = match.group().lstrip('_')
  data_start = g.index('(')
  prefix = g[:data_start]
  data = ', '.join(g[data_start   1:-1].split('_'))
  print(f'{prefix}:{data}')

Output

I
30TCH:50EDFva, 25VAP, 25SNE
20UDS:80EDFvd, 10VAP, 10SNE
20EDU:SNE
10UDS:80EDFva, 10VAP, 10SNE
10EDU:50EDFva, 50VAP
10EDP:50EDFva, 50SNE