Home > Mobile >  Extracting numbers from a line in python
Extracting numbers from a line in python

Time:04-12

My input file looks something like this and I want to extract out the numbers 43.039, 10149.2 and 1521.42 (they always have spaces around them/are in a different column). Other integers might exist before or after these numbers for example, X2 and ABC[12]. Output needs to be float.

check 43.038 -hi X -max [hello {ABC[12]}]
check -hi X  -max 232.00 EFG
check -hi X2 -max -add 10149.2 [hello {XYZ}]
check -hi Y 1521.42 [hello {PQR[3]}]

I tried the following:

def updates (self, fileHandler):
   for line in fileHandler:
      line_new = line.strip('\n')
      ll = line_new.split()
      l = len(ll)

      try: 
          number = float(ll[1])
      except:  
          try: 
            for j in range (l-1):
              if (ll[j] == "-max"):
                number = float(ll[j 1])  
          except:
            if (ll[l-2] == "[hello"):
              number = float(ll[l-3])       
            else:
              number = float(ll[l-2])

The first line should be handled by ll[1], second line by the "-max" condition, and so on. The last two conditions don't seem to be working & number gets the default value of 0. Is there a better way to find these numbers in these lines using regex?

CodePudding user response:

Following @Barmar's comment, you can do the following to extract the numbers. In my code, it is assumed your text is in the file "blankpaper.txt".

import re

some_list = []
with open("blankpaper.txt") as f:
    for line in f:
        some_list.append(float(re.search(r'\b\d .\d \b', line)[0]))

print(some_list)

Output

[43.038, 232.0, 10149.2, 1521.42]

In the regular expression pattern r'\b\d .\d \b', \b matches empty strings at the beginning or end of a "word", \d matches one or more decimal digits and the . matches a decimal point.

search will look for the first location where the above pattern yields a match. If there is a match, a match object is returned which can be used to get the string that it matched. It there is no match, it returns None. You may wish to use the return values to check if a match was successful or not before actual processing. For example,

m = re.search(r'\b\d .\d \b', line)
if m:
    some_list.append(float(m[0]))
# optional else

If you did not want the restriction that search only looks for the first match, you can use findall instead with the same regular expression pattern to extract multiple floating-point numbers on a line. In this case you would drop the subscript to the match object and use extend rather than append:

m = re.findall(r'\b\d .\d \b', line)
if m:
    some_list.extend(float(i) for i in m)

CodePudding user response:

If there's always spaces around the numbers, then you can try to cast each word in the file to a float. If it goes through, assign it to the number variable. If it doesn't, do nothing.

with open('in.txt') as in_file:
    for line in in_file:
        for token in line.strip().split():
            try:
                number = float(token)
                print(number)
            except ValueError:
                continue

With the given input file, this outputs:

43.038
232.0
10149.2
1521.42
  • Related