Home > Back-end >  Regular Expressions: Identify physical quantities i.e.numbers followed by it's SI/Imperial unit
Regular Expressions: Identify physical quantities i.e.numbers followed by it's SI/Imperial unit

Time:11-11

I am trying to write a regex pattern to identify physical quantities in a string i.e.numbers followed by it's SI/Imperial units.

The pattern in which they occur in a string could vary a lot.

Some examples:

'510 PS (375 kW / 503 bhp) at 8500 rpm' , 
'100 Litre (26.4 Gallon US / 22 Gallon Imperial)', 
'(bore) 82mm x 62mm (stroke)-> ~3929cm³',
'115 bhp / 86 kW\n\t\t\t @ 6,000 rpm','442lb ft @ 3000-7500rpm',
'356 Nm (36.3 mkg / 263 lb-ft) at 5000 rpm (~ 2.3)' 

etcetra and you can now imagine other such weird patterns.

The main aim here is to always capture a number followed by its unit example: '510 PS','3000-7500rpm', '26.4 Gallon US','86 kW','263 lb-ft', etcetra.

So far i tried a bunch of stuff:

re.findall(r"\d \.*?\d ?\s?\W?\s?[a-zA-Z\s ] ",string),
re.findall(r"\d \. \d ?\s*\W*\s*[a-zA-Z\s ]*",string), 
re.findall('\d \.?\d*?\s?\w \W?\w ',string), 

But most probably this regex cant be written for all possible edge cases at once so maybe try a bunch of regex and then run them all on a string and get the matches combined in a list for further processing.

Any help is much appreciated. Point me in the right direction. Thanks a ton!

EDIT:

tests = ['209.5 in | 5320 mm.','4,868 mm (191.65 in)','64.5inches','front 15.6x1.4 in/rear 13.7x1.3 in',
 '27.7 US gal (23.1 UK/gal)','90 Litre (23.8 Gallon US / 19.8 Gallon Imperial)','n.b.'," ",
 '103.1 inches - Track : Front 65.7 inches / Rear 63.8 inches','95 mm x 76.4 mm (3.74 x 3.01 in)',
 '510 PS (375 kW / 503 bhp) at 8500 rpm' , '100 Litre (26.4 Gallon US / 22 Gallon Imperial)', 
'(bore) 82mm x 62mm (stroke)-> ~3929cm³','115 bhp / 86 kW\n\t\t\t @ 6,000 rpm','442lb ft @ 3000-7500rpm',
'356 Nm (36.3 mkg / 263 lb-ft) at 5000 rpm (~ 2.3 inch)','65.0 mm (2.6 in) / 58.8 mm (2.3 in)',
 '339 BHP (249.504 KW)  @ 7000 RPM','690 Nm (70.4 mkg / 509 lb-ft) at 5500 rpm (~ 540 PS)','500Nm','8,500 rpm',
 '11.5 : 1','16,2 :1','0-100 km/h (0-62 mph)','~331.45 kph / 206 mph','317 km/h | 197.016 mph (limited)',
 '668PS-per-tonne (659bhp)','2 kg/PS = 4.6 lbs/hp = 360 kW/T','0.21 bhp / kg','6.1 lb/CV',
 '3.91:1 2.29:1 1.58:1 1.18:1 0.94:1 0.79:1 0.67:1','245/45r17','15.6 litres/100kms (18.1 mpg)']

RESULTS:

209.5 in | 5320 mm. -> ['209.5 in ', '5320 mm']
4,868 mm (191.65 in) -> ['4,868 mm ', '191.65 in']
64.5inches -> ['64.5inches']
front 15.6x1.4 in/rear 13.7x1.3 in -> ['15.6x1', '4 in', '13.7x1', '3 in']
27.7 US gal (23.1 UK/gal) -> ['27.7 US gal ', '23.1 UK']
90 Litre (23.8 Gallon US / 19.8 Gallon Imperial) -> ['90 Litre ', '23.8 Gallon US ', '19.8 Gallon Imperial']
n.b. -> []
  -> []
103.1 inches - Track : Front 65.7 inches / Rear 63.8 inches -> ['103.1 inches ', '65.7 inches ', '63.8 inches']
95 mm x 76.4 mm (3.74 x 3.01 in) -> ['95 mm x 76', '4 mm ', '3.74 x 3', '01 in']
510 PS (375 kW / 503 bhp) at 8500 rpm -> ['510 PS ', '375 kW ', '503 bhp', '8500 rpm']
100 Litre (26.4 Gallon US / 22 Gallon Imperial) -> ['100 Litre ', '26.4 Gallon US ', '22 Gallon Imperial']
(bore) 82mm x 62mm (stroke)-> ~3929cm³ -> ['82mm x 62mm ', '3929cm³']
115 bhp / 86 kW
             @ 6,000 rpm -> ['115 bhp ', '86 kW\n\t\t\t ', '6,000 rpm']
442lb ft @ 3000-7500rpm -> ['442lb ft ', '3000', '7500rpm']
356 Nm (36.3 mkg / 263 lb-ft) at 5000 rpm (~ 2.3 inch) -> ['356 Nm ', '36.3 mkg ', '263 lb', '5000 rpm ', '2.3 inch']
65.0 mm (2.6 in) / 58.8 mm (2.3 in) -> ['65.0 mm ', '2.6 in', '58.8 mm ', '2.3 in']
339 BHP (249.504 KW)  @ 7000 RPM -> ['339 BHP ', '249.504 KW', '7000 RPM']
690 Nm (70.4 mkg / 509 lb-ft) at 5500 rpm (~ 540 PS) -> ['690 Nm ', '70.4 mkg ', '509 lb', '5500 rpm ', '540 PS']
500Nm -> ['500Nm']
8,500 rpm -> ['8,500 rpm']
11.5 : 1 -> ['11.5 ', '1']
16,2 :1 -> ['16,2 ', '1']
0-100 km/h (0-62 mph) -> ['0', '100 km', '0', '62 mph']
~331.45 kph / 206 mph -> ['331.45 kph ', '206 mph']
317 km/h | 197.016 mph (limited) -> ['317 km', '197.016 mph ']
668PS-per-tonne (659bhp) -> ['668PS', '659bhp']
2 kg/PS = 4.6 lbs/hp = 360 kW/T -> ['2 kg', '4.6 lbs', '360 kW']
0.21 bhp / kg -> ['0.21 bhp ']
6.1 lb/CV -> ['6.1 lb']
3.91:1 2.29:1 1.58:1 1.18:1 0.94:1 0.79:1 0.67:1 -> ['3.91', '1 2', '29', '1 1', '58', '1 1', '18', '1 0', '94', '1 0', '79', '1 0', '67', '1']
245/45r17 -> ['245', '45r17']
15.6 litres/100kms (18.1 mpg) -> ['15.6 litres', '100kms ', '18.1 mpg']

This is sample tests that hopefully the regex pattern should be able to detect. I tried @Daweo approach and it failing some of the tests. Can you please suggest any improvements to pass those tests too. Thanks a ton!

CodePudding user response:

I would do it following way

import string
import re
punct = re.escape(string.punctuation)
tests = ['510 PS (375 kW / 503 bhp) at 8500 rpm' , 
'100 Litre (26.4 Gallon US / 22 Gallon Imperial)', 
'(bore) 82mm x 62mm (stroke)-> ~3929cm³',
'115 bhp / 86 kW\n\t\t\t @ 6,000 rpm','442lb ft @ 3000-7500rpm',
'356 Nm (36.3 mkg / 263 lb-ft) at 5000 rpm (~ 2.3)']
for t in tests:
    print(re.findall(f'\\d [.,]?\\d*[^{punct}]*',t))

output

['510 PS ', '375 kW ', '503 bhp', '8500 rpm']
['100 Litre ', '26.4 Gallon US ', '22 Gallon Imperial']
['82mm x 62mm ', '3929cm']
['115 bhp ', '86 kW\n\t\t\t ', '6,000 rpm']
['442lb ft ', '3000', '7500rpm']
['356 Nm ', '36.3 mkg ', '263 lb', '5000 rpm ', '2.3']

Explanation: I used following pattern \\d [.,]?\\d* to describe number i.e. one or more digits, optionally . or , then zero or more digits, feel free to change it to pattern compliant with your definition of number. After number grab everything to first encountered punctuation character, where punctuation character is defined as one of !"#$%&'()* ,-./:;<=>?@[\]^_{|}~`

Note that there are trailing whitespaces in some cases, but they might be easile removed using .rstrip() method for example

found = ['115 bhp ', '86 kW\n\t\t\t ', '6,000 rpm']
cleaned = [i.rstrip() for i in found]
print(cleaned)

output

['115 bhp', '86 kW', '6,000 rpm']

Note that there are some false-positives (for example last element of last row), but they are easy to dectect after cleaning, as it should suffice to check if last character is digit. I am not sure why superscript 3 is missing, as it is not one of string.punctuation.

  • Related