I have a text in every cell of a column, where i want to get some information from. In every cell i have detailed information about cars and i need to get the text from it. In my case these are the fuel and the CO2 information.
The string, that i get, looks like this:
cell 1 = 17.160 km, 80 kW (109 PS)Limousine, Autogas (LPG), Automatik, HU Neu, 2/3 Türenca. 5,0 l/100km (komb.), ca. 116 g CO₂/km (komb.)
cell 2 = EZ 10/2018, 12.900 km, 80 kW (109 PS)Limousine, Unfallfrei, Hybrid (Benzin/Elektro), Halbautomatik, HU Neu, ca. 5,9 l/100km (komb.), ca. 134 g CO₂/km (komb.) ... and so on
so i need the information from cell 1: 5,0 l/100 km and 116 g CO2/km
and from cell 2: 5,9 l/100km and 134 g CO2/km
I tried the following code examples, but nothing worked:
pattern_z = re.compile("[a-z] .?\s?[0-9] \s?[a-z]?\s[A-Z] ")
pattern_z = re.compile("^[ac] \s?[CO]$")
pattern_z = re.compile(r'[0-9] .[g]?')
and after each "pattern_z" variable i tried
co = pattern_z.search(i)
cox = co.group()
but nothing worked.
I would appreciate every help.
CodePudding user response:
Use
(\d (?:,\d )?\s*l/\d km).*?(\d \s?g\s*CO[₂2]/km)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\d digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
, ','
--------------------------------------------------------------------------------
\d digits (0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
l/ 'l/'
--------------------------------------------------------------------------------
\d digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
km 'km'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\d digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
--------------------------------------------------------------------------------
g 'g'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
CO 'CO'
--------------------------------------------------------------------------------
[₂2] any character of: '&', '#', '8', '3',
'2', '2', ';', '2'
--------------------------------------------------------------------------------
/km '/km'
--------------------------------------------------------------------------------
) end of \2
import re
regex = r"(\d (?:,\d )?\s*l/\d km).*?(\d \s?g\s*CO[₂2]/km)"
test_str = "17.160 km, 80 kW (109 PS)Limousine, Autogas (LPG), Automatik, HU Neu, 2/3 Türenca. 5,0 l/100km (komb.), ca. 116 g CO₂/km (komb.)\n\nEZ 10/2018, 12.900 km, 80 kW (109 PS)Limousine, Unfallfrei, Hybrid (Benzin/Elektro), Halbautomatik, HU Neu, ca. 5,9 l/100km (komb.), ca. 134 g CO₂/km (komb.) ... and so on"
print (re.findall(regex, test_str))
Results: [('5,0\u2009l/100km', '116\u2009g CO₂/km'), ('5,9\u2009l/100km', '134\u2009g CO₂/km')]
CodePudding user response:
You might use
\b\d (?:,\d )?(?:\s*l/\d |\s*g\s CO₂/)km\b
\b
A word boundary\d (?:,\d )?
Match 1 digits and an optional decimal part(?:
Non catpure group\s*l/\d
matchl/
and 1 digits|
Or\s*g\s CO₂/
matchg
, whitespace chars and CO₂/
)
Close non capture groupkm\b
Matchkm
and a word boundary to prevent a partial match
import re
strings = [
'17.160 km, 80 kW (109 PS)Limousine, Autogas (LPG), Automatik, HU Neu, 2/3 Türenca. 5,0 l/100km (komb.), ca. 116 g CO₂/km (komb.)',
'EZ 10/2018, 12.900 km, 80 kW (109 PS)Limousine, Unfallfrei, Hybrid (Benzin/Elektro), Halbautomatik, HU Neu, ca. 5,9 l/100km (komb.), ca. 134 g CO₂/km (komb.)'
]
pattern = r"\b\d (?:,\d )?(?:\s*l/\d |\s*g\s CO₂/)km\b"
for s in strings:
print(re.findall(pattern, s))
Output
['5,0 l/100km', '116 g CO₂/km']
['5,9 l/100km', '134 g CO₂/km']