I have the following string for which I want to extract data:
text_example = '\nExample text \nTECHNICAL PARTICULARS\nLength oa: ...............189.9m\nLength bp: ........176m\nBreadth moulded: .......26.4m\nDepth moulded to main deck: ....9.2m\n
- Every variable I want to extract starts with \n
- The value I want to get starts with a colon ':' followed by more than 1 dot
- When it doesnt start with a colon followed by dots, I dont want to extract that value.
For example my preferred output looks like:
LOA = 189.9
LBP = 176.0
BM = 26.4
DM = 9.2
CodePudding user response:
import re
text_example = '\nExample text \nTECHNICAL PARTICULARS\nLength oa: ...............189.9m\nLength bp: ........176m\nBreadth moulded: .......26.4m\nDepth moulded to main deck: ....9.2m\n'
# capture all the characters BEFORE the ':' character
variables = re.findall(r'(.*?):', text_example)
# matches all floats and integers (does not account for minus signs)
values = re.findall(r'(\d (?:\.\d )?)', text_example)
# zip into dictionary (this is assuming you will have the same number of results for both regex expression.
result = dict(zip(variables, values))
print(result)
--> {'Length oa': '189.9', 'Breadth moulded': '26.4', 'Length bp': '176', 'Depth moulded to main deck': '9.2'}
CodePudding user response:
You can create a regex and workaround the solution-
re.findall(r'(\\n|\n)([A-Za-z\s]*)(?:(\:\s*\. ))(\d*\.*\d*)',text_example)[2]
('\n', 'Breadth moulded', ': .......', '26.4')