Match pattern and put following lines into data structure-CodePudding

I have a data feed that I download on a regular bases into a csv. It looks like this

TABLE # 196712 / 9000_
>= 10   : 0.002
>= 5    : 0.001
>= 2    : 0.0005
>= 1    : 0.0002
>= 0.5  : 0.0001
>= 0.2  : 0.0001
>= 0.1  : 0.0001
>= 0.0001   : 0.0001
TABLE # 196714 / Dark
>= 0.0001   : 5e-05
TABLE # 196715 / GBD
>= 25   : 0.01
>= 10   : 0.005
>= 5    : 0.0025
>= 0.1  : 0.001
>= 0.0005   : 0.005

I would like to parse the file and categorize the data into a dictionary, where the number after the hash is a unique id (the new dict key) and the following rows (starting with >=) are volumes plus associated penalty values.

s.th like this would work:

{196712: [(10,0.002),(5,0.001),(2,0.0005),(1,0.0002),(0.5,0.0001),(0.2,0.0001),(0.1,0.0001),(0.0001, 0.0001)], 
 196714: [(0.0001,5e-05)], 
 196715: [(25,0.01),(10,0.005),(5,0.0025),(0.1,0.001),(0.0005,0.005)]}

What I would do to filter it outside python would be a grep and get the following lines, however the varying number of lines between IDs makes it more complex. Any other suggested more convenient data structure could be used as well.

CodePudding user response：

Try:

s = """\
TABLE # 196712 / 9000_
>= 10   : 0.002
>= 5    : 0.001
>= 2    : 0.0005
>= 1    : 0.0002
>= 0.5  : 0.0001
>= 0.2  : 0.0001
>= 0.1  : 0.0001
>= 0.0001   : 0.0001
TABLE # 196714 / Dark
>= 0.0001   : 5e-05
TABLE # 196715 / GBD
>= 25   : 0.01
>= 10   : 0.005
>= 5    : 0.0025
>= 0.1  : 0.001
>= 0.0005   : 0.005"""

import re

out = {}
for table, data in re.findall(
    r"^TABLE # (\d ).*?\n(.*?)(?=^TABLE|\Z)", s, flags=re.M | re.S
):
    table = int(table)
    for a, b in re.findall(r"([\de. -] )\s*:\s*([\de. -] )", data):
        out.setdefault(table, []).append((float(a), float(b)))

print(out)

Prints:

{
    196712: [
        (10.0, 0.002),
        (5.0, 0.001),
        (2.0, 0.0005),
        (1.0, 0.0002),
        (0.5, 0.0001),
        (0.2, 0.0001),
        (0.1, 0.0001),
        (0.0001, 0.0001),
    ],
    196714: [(0.0001, 5e-05)],
    196715: [
        (25.0, 0.01),
        (10.0, 0.005),
        (5.0, 0.0025),
        (0.1, 0.001),
        (0.0005, 0.005),
    ],
}

CodePudding user response：

import fileinput
import sys
import re
from collections import defaultdict
from pprint import pprint

def parse_records(lines):
    for l in lines:
        if m := re.match(r'TABLE # (\d ) /.*', l):
            yield m.groups()[0]

        if m := re.match(r'>= (\S )\s : (.*)', l):
            yield m.groups()


result = defaultdict(list)
record_id = None
for l in parse_records(fileinput.input()):
    match l:
        case (volume, penality):
            result[record_id].append((float(volume), float(penality)))
        case id:
            record_id=int(id)

print("{")
for key, value in result.items():
    print(f" {key}: {value}")
print("}")

Run it:

% python3 t2.py < input.txt
{
 196712: [(10.0, 0.002), (5.0, 0.001), (2.0, 0.0005), (1.0, 0.0002), (0.5, 0.0001), (0.2, 0.0001), (0.1, 0.0001), (0.0001, 0.0001)]
 196714: [(0.0001, 5e-05)]
 196715: [(25.0, 0.01), (10.0, 0.005), (5.0, 0.0025), (0.1, 0.001), (0.0005, 0.005)]
}
%