Home > database >  How to find a specific piece of text in a string with regex and python
How to find a specific piece of text in a string with regex and python

Time:11-12

I have a text in every cell of a column, where i want to get some information from. In every cell i have detailed information about cars and i need to get the text from it. In my case these are the fuel and the CO2 information.

The string, that i get, looks like this:

cell 1 = 17.160 km, 80 kW (109 PS)Limousine, Autogas (LPG), Automatik, HU Neu, 2/3 Türenca. 5,0 l/100km (komb.), ca. 116 g CO₂/km (komb.)

cell 2 = EZ 10/2018, 12.900 km, 80 kW (109 PS)Limousine, Unfallfrei, Hybrid (Benzin/Elektro), Halbautomatik, HU Neu, ca. 5,9 l/100km (komb.), ca. 134 g CO₂/km (komb.) ... and so on

so i need the information from cell 1: 5,0 l/100 km and 116 g CO2/km

and from cell 2: 5,9 l/100km and 134 g CO2/km

I tried the following code examples, but nothing worked:

    pattern_z = re.compile("[a-z] .?\s?[0-9] \s?[a-z]?\s[A-Z] ")
    pattern_z = re.compile("^[ac] \s?[CO]$")
    pattern_z = re.compile(r'[0-9] .[g]?')
    

and after each "pattern_z" variable i tried

    co = pattern_z.search(i)
    cox = co.group()

but nothing worked.

I would appreciate every help.

CodePudding user response:

Use

(\d (?:,\d )?\s*l/\d km).*?(\d \s?g\s*CO[₂2]/km)

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \d                       digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      ,                        ','
--------------------------------------------------------------------------------
      \d                       digits (0-9) (1 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    l/                       'l/'
--------------------------------------------------------------------------------
    \d                       digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    km                       'km'
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  .*?                      any character except \n (0 or more times
                           (matching the least amount possible))
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    \d                       digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    \s?                      whitespace (\n, \r, \t, \f, and " ")
                             (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    g                        'g'
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    CO                       'CO'
--------------------------------------------------------------------------------
    [₂2]               any character of: '&', '#', '8', '3',
                             '2', '2', ';', '2'
--------------------------------------------------------------------------------
    /km                      '/km'
--------------------------------------------------------------------------------
  )                        end of \2

Python code:

import re

regex = r"(\d (?:,\d )?\s*l/\d km).*?(\d \s?g\s*CO[₂2]/km)"

test_str = "17.160 km, 80 kW (109 PS)Limousine, Autogas (LPG), Automatik, HU Neu, 2/3 Türenca. 5,0 l/100km (komb.), ca. 116 g CO₂/km (komb.)\n\nEZ 10/2018, 12.900 km, 80 kW (109 PS)Limousine, Unfallfrei, Hybrid (Benzin/Elektro), Halbautomatik, HU Neu, ca. 5,9 l/100km (komb.), ca. 134 g CO₂/km (komb.) ... and so on"

print (re.findall(regex, test_str))

Results: [('5,0\u2009l/100km', '116\u2009g CO₂/km'), ('5,9\u2009l/100km', '134\u2009g CO₂/km')]

CodePudding user response:

You might use

\b\d (?:,\d )?(?:\s*l/\d |\s*g\s CO₂/)km\b
  • \b A word boundary
  • \d (?:,\d )? Match 1 digits and an optional decimal part
  • (?: Non catpure group
    • \s*l/\d match l/ and 1 digits
    • | Or
    • \s*g\s CO₂/ match g, whitespace chars and CO₂/
  • ) Close non capture group
  • km\b Match km and a word boundary to prevent a partial match

Regex demo

import re

strings = [
    '17.160 km, 80 kW (109 PS)Limousine, Autogas (LPG), Automatik, HU Neu, 2/3 Türenca. 5,0 l/100km (komb.), ca. 116 g CO₂/km (komb.)',
    'EZ 10/2018, 12.900 km, 80 kW (109 PS)Limousine, Unfallfrei, Hybrid (Benzin/Elektro), Halbautomatik, HU Neu, ca. 5,9 l/100km (komb.), ca. 134 g CO₂/km (komb.)'
    ]
pattern = r"\b\d (?:,\d )?(?:\s*l/\d |\s*g\s CO₂/)km\b"
for s in strings:
    print(re.findall(pattern, s))

Output

['5,0 l/100km', '116 g CO₂/km']
['5,9 l/100km', '134 g CO₂/km']
  • Related