Home > OS >  How to extract measurements and/or units from texts?
How to extract measurements and/or units from texts?

Time:09-21

Situation

Given are some titles with measuremnts and units in various combinations, I´m focused to extract the measurements and units with unit m.

1 Kabel0,3m
2 Kabel,0,3 m m 
3 Kabelx,0.3 m
4 Kabel 1m,
5 Kabel 1 m/
6 Kabel 1 HW-Y 2.0 m LAN/LAN RJ45/2xRJ45 Homeway f.2 unabh.Datennetz-Anwend.blau
7 Rundleitung 0,24 mm 2/ 250 m,8p   
8 Televes TV/RF-Empfängeranschlußkabel 10, 0 m weiss

Best try

Still struggling with the the line 7 to exclude the mm

(?P<match>(?P<value>\d (?:\.|,|)\s*\d*)\s*(?<unit>m))

https://regex101.com/r/5yH4GN/1

Expected result

Hope somebody can give me a hint, to come closer to a solution.

match value unit
0,3m 0,3 m
0,3 m 0,3 m
0.3 m 0.3 m
1m 1 m
1 m 1 m
2.0 m 2.0 m
250 m 250 m
10, 0m 10, 0 m

CodePudding user response:

For the unit m, optionally match the decimal part \d (?:[.,]\s*\d )? where the digits after the dot or comma are not optional.

You could add the dot and comma to a character class [.,] and add a word boundary \b after the first m to for example not match mm

(?P<match>(?P<value>\d (?:[.,]\s*\d )?)\s*(?<unit>m\b))

Regex demo

  • Related