Home > Back-end >  Extract info from a bullet list text in python
Extract info from a bullet list text in python

Time:09-05

I'm working in Python on EU Nace Classification system (see for example https://nacev2.com/en) and I would like to manipulate the description text of leaf activities to get a finite list of what's included. For example from this text:

This class includes:
- decaffeinating and roasting of coffee
- production of coffee products:
  . ground coffee
  . soluble coffee
  . extracts and concentrates of coffee
- manufacture of coffee substitutes
- blending of tea and maté
- manufacture of extracts and preparations based on tea or maté
- packing of tea including packing in tea-bags

I would like to get

- decaffeinating and roasting of coffee
- production of coffee products
- production of ground coffee
- production of soluble coffee
- production of extracts and concentrates of coffee
- manufacture of coffee substitutes
- blending of tea and maté
- manufacture of extracts and preparations based on tea or maté
- packing of tea including packing in tea-bags

i.e. I want to transform second level bullet points by adding the specification of the activity from the corresponding first level bullet point.

Do you have any suggestion on any library or tool that could help me? I thought about regex but I cannot get to any good result :(

CodePudding user response:

here is a bit hacky solution:

data = """
This class includes:
- decaffeinating and roasting of coffee
- production of coffee products:
  . ground coffee
  . soluble coffee
  . extracts and concentrates of coffee
- manufacture of coffee substitutes
- blending of tea and maté
- manufacture of extracts and preparations based on tea or maté
- packing of tea including packing in tea-bags
"""

result = []
first_level = None
for line in data.split('\n'):
    if not line:
        continue

    if line.strip()[0] == '-':
        first_level = line.lstrip(' -').rstrip(' :')
        result.append(first_level)
    elif line.strip()[0] == '.':
        activity = first_level.split(' of ')[0]
        result.append(
            f'{activity} of {line.lstrip(" .")}'
        )

print('\n'.join(result))

It works only if all second-level bullet points start with a dot and all activities contain word "of". Let me know if there are any edge cases that would need to be accounted for.

CodePudding user response:

Try:

text = """\
This class includes:
- decaffeinating and roasting of coffee
- production of coffee products:
  . ground coffee
  . soluble coffee
  . extracts and concentrates of coffee
- manufacture of coffee substitutes
- blending of tea and maté
- manufacture of extracts and preparations based on tea or maté
- packing of tea including packing in tea-bags"""

out, first_level = [], ""
for line in map(str.strip, text.splitlines()):
    if line.startswith("-"):
        first_level = line.strip(" -").split()[0]
        out.append(line)
    elif line.startswith("."):
        out.append("- "   line.strip(" .")   " "   first_level)

print(*out, sep="\n")

Prints:

- decaffeinating and roasting of coffee
- production of coffee products:
- ground coffee production
- soluble coffee production
- extracts and concentrates of coffee production
- manufacture of coffee substitutes
- blending of tea and maté
- manufacture of extracts and preparations based on tea or maté
- packing of tea including packing in tea-bags

CodePudding user response:

# with the help of Andrej Kesely idea of map(str.strip
new_text_as_list = []

for l in map(str.strip, data.splitlines()):
    if ":" in l:
        l = re.sub(r'(-.*of\b.*?):',r'\1:', l)
        new_text_as_list.append(l)
    elif l.startswith('.'):
        l = re.sub(r'\.', '-', l)
        new_text_as_list.append(l)
    else:
        new_text_as_list.append(l)


new_text = '\n'.join(new_text_as_list)

- decaffeinating and roasting of coffee
- production of coffee products:
- ground coffee
- soluble coffee
- extracts and concentrates of coffee
- manufacture of coffee substitutes
- blending of tea and maté
- manufacture of extracts and preparations based on tea or maté
- packing of tea including packing in tea-bags
  • Related