Home > Software design >  Extract strings between brackets and nested brackets
Extract strings between brackets and nested brackets

Time:10-24

So I have a file of text and titles, (titles indicated with the starting ";")

;star/stellar_(class(ification))_(chart)

Hertz-sprussels classification of stars is shows us . . .

What I want to do is have it where it's split by "_" into ['star/stellar','(class(ification))','(chart)'], interating through them and extracting whats in the brackets, e.g. '(class(ification))' to {'class':'ification'} and (chart) to just ['chart']. All i've done so far is the splitting part

for ln in open(file,"r").read().split("\n"):
    if ln.startswith(";"):
        keys=ln[1:].split("_")

I have ways to extract bits in brackets, but I have had trouble finding a way that supports nested brackets in order. I've tried things like re.findall('\(([^)] )',ln) but that returns ['star/stellar', '(class', 'chart']. Any ideas?

CodePudding user response:

You can split (again) on the parentheses then do some cleaning:

x = ['star/stellar','(class(ification))','(chart)']

for v in x:
  y = v.split('(')
  y = [a.replace(')','') for a in y if a != '']
  if len(y) > 1:
    print(dict([y]))
  else:
    print(y)

Gives:

['star/stellar']
{'class': 'ification'}
['chart']

CodePudding user response:

If all of the title lines have the same format, that is they all have these three parts ;some/title_(some(thing))_(something), then you can catch the different parts to separate variables:

first, second, third = ln.split("_")

From there, you know that:

  • for the first item you need to drop the ;:

    first = first[1:]
    
  • for the second item, you want to extract the stuff in the parentheses and then merge it into a dict:

    k, v = filter(bool, re.split('[()]', second))
    second = {k:v}
    
  • for the third item, you want to drop the surrounding parentheses

    third = third[1:-1]
    

Then you just need to put them all together again:

[first, second, third]

CodePudding user response:

You can do this with splits. If you separate the string using '_(' instead of only '_', the second part onward will be an enclosed keyword. you can strip the closing parentheses and split those parts on the '(' to get either one component (if there was no nested parentesis) or two components. You then form either a one-element list or dictionary depending on the number of components.

line = ";star/stellar_(class(ification))_(chart)"

if line.startswith(";"):
    parts = [ part.rstrip(")") for part in line.split("_(")[1:]]
    parts = [ part.split("(",1) for part in parts ]
    parts = [ part if len(part)==1 else dict([part]) for part in parts ]
    print(parts)
       
[{'class': 'ification'}, ['chart']] 

Note that I assumed that the first part of the string is never included in the process and that there can only be one nested group at the end of the parts. If that is not the case, please update your question with relevant examples and expected output.

  • Related