I have a list similar to this one below, but much larger.
mylist = [' 12345678912 ST',
' Halterung für Fortlüfterhaube',
' Material/Werkstoff: Metall-Lackiert',
' **Beginn Zeichnung**',
' 98765432164 ST',
' Klappe, komplett',
' **Beginn Zeichnung**',
' 74563254671 ST',
' Sieb Außen-Dm 145 x 0,8mm',
' Versatz Dm 122 x 5mm tief',
' Material: Niro 1.4301 - Lochblech Dm1/LA1,5mm',
' 90876487921 M',
' Gista-Profil',
' mit Moosgummihohlkammer-Dichtung (EPDM)',
' Farbe: schwarz, Klemmbereich: 1-2 mm',
' Material: EPDM, 60 /- 5 Shore A,',
' 64352647971 ST',
' Winkelblech für Frost Erdungskontakt (AB 434 l)',
' für TGr. 78.2',
' Winkelblech für Frost Erdungskontakt (AB 434 l)',
' für TGr. 78.2',
' für TGr. 78.2',
' Material/Werkstoff: X5CrNi 1810']
The goal for me is to extract the Material name (if present) for each ID in the list (along with the ID itself).
I've used the following code:
Materials = []
iteration_list = mylist
for item in iteration_list:
if str(item).strip().startswith("Material"):
material_index = iteration_list.index(item)
ID = "".join(re.findall(r'\d ', str(iteration_list[material_index - 1])))
if len(ID) != 11:
ID = "".join(re.findall(r'\d ', str(iteration_list[material_index - 2])))
if len(ID) != 11:
ID = "".join(re.findall(r'\d ', str(iteration_list[material_index - 3])))
if len(ID) != 11:
ID = "".join(re.findall(r'\d ', str(iteration_list[material_index - 4])))
if len(ID) != 11:
ID = "".join(re.findall(r'\d ', str(iteration_list[material_index - 5])))
if len(ID) != 11:
ID = "".join(re.findall(r'\d ', str(iteration_list[material_index - 6])))
Materials.extend([ID, item])
Which produces this:
['12345678912',
' Material/Werkstoff: Metall-Lackiert',
'74563254671',
' Material: Niro 1.4301 - Lochblech Dm1/LA1,5mm',
'90876487921',
' Material: EPDM, 60 /- 5 Shore A,',
'64352647971',
' Material/Werkstoff: X5CrNi 1810']
So I first looked for the Material and then tried to extract the respective ID. The problem that I'm currently facing is that the Material is positioned randomly below each ID and it gets complicated/ugly with the IF statements to get the ID based on index relative to the index of the material.
My question is, is it possible to just somehow find ID (11-digit number) above each found Material, without writing many if statements to capture all the possible variations. (The ID is always 11-digit long).
CodePudding user response:
Instead of searching back from the Material line to the previous ID line, just remember each ID as you encounter them in a variable. Then when you get to the Material line, the mid
variable holds the value from the last ID before it.
materials = []
for line in mylist:
line = line.strip()
m = re.match(r'\d{11}\b', line)
if m:
mid = m.group()
elif line.startswith("Material"):
materials.append([mid, line])
CodePudding user response:
Along the lines of Barmar's answer:
data = []
for line in mylist:
if (m := re.match(r'\s*(\d{11})', line)):
data.append((m.group(1), []))
elif data and (m := re.match(r'^\s*Material:\s*(. )$', line)):
data[-1][1].append(m.group(1))
data = [d for d in data if d[1]]
Result:
[('74563254671', ['Niro 1.4301 - Lochblech Dm1/LA1,5mm']),
('90876487921', ['EPDM, 60 /- 5 Shore A,'])]