I have a large text with words and numbers. And in the text there is multiple lines like this:
Linear regression is done. value: 123.235
Of course, the number changes in the document. The problem is: I really need those numbers. But it would take ages to go through 100.000 lines and get all numbers per hand. I tried regex, but I am not good at regex. Anyone who can help?
import re
file = open('filename.txt', 'r')
x = re.findall("value", file)
print(value)
Would be nice if you could help me get all numbers after value.
CodePudding user response:
We can use re.findall
as follows:
with open('filename.txt', 'r') as file:
data = file.read()
nums = re.findall(r'\bvalue:\s*(\d (?:\.\d )?)', data)
CodePudding user response:
Given the following sample.txt
file containing Linear regression is done. value: <value>
, 6 times:
sample.txt:
Linear regression is done. value: 000.00 ssdfsdfsdfhklshdfkhskldhflsdf
Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium
Linear regression is done. value: 123.12 doloremque, Linear regression is done. value: 0.0123 eaque
dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem
ipsum quia Linear regression is done. value: 234.23 dolor sit amet, consectetur, adipisci velit, sed
quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad
minima veniam, quis nostrum exercitationem ullam corporis suscipit Linear regression is done. value: 345.34 laboriosam,
nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam
nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?
lskdfhlshdfl Linear regression is done. value: 456.45
This is one way to do it:
import re
REGEX = 'Linear regression is done. value: [ -]?([0-9] \.?[0-9]*|\.[0-9] )'
if __name__ == '__main__':
numbers_in_text = []
with open('sample.txt', 'r') as file:
for line in file:
numbers_in_line = re.findall(REGEX, line)
numbers_in_text.extend(numbers_in_line)
print(numbers_in_text)
assert 6 == len(numbers_in_text), 'It is not reading all the numbers'
prints:
['000.00', '123.12', '0.0123', '234.23', '345.34', '456.45']