How to extract the first uppercase words and numbers from a vehicle model and derivative string?-CodePudding

I currently get a list of vehicles from a 3rd party, but it is all a single long string. I used RegEx to split based on the engine size (e.g. 1.2, 120kW, etc), but this only works for certain car brands. Others will use a different format for their vehicle naming.

I have since noticed that the Model of the vehicle is always in UPPERCASE at the beginning of the string. The remainder is the derivative of the vehicle (e.g. the engine, spec, doors etc).

Here are some examples of strings. I have added a line '|' where it needs splitting (the line isn't there in the actual string):

XC40 DIESEL ESTATE | 2.0 D4 [190] Inscription Pro 5dr AWD Gea

RAPID DIESEL HATCHBACK | 1.6 TDI CR Elegance 5dr

ID.4 ESTATE | 109kW City Pure 52kWh 5dr Auto V40 DIESEL HATCHBACK D2 [120] Momentum 5dr Geartronic

HILUX SPECIAL EDITIONS | Invincible X Ltd Ed D/Cab P/Up 2.4 D

A CLASS DIESEL HATCHBACK | A180d Sport Executive 5dr Auto X3 ESTATE xDrive M40i 5dr Auto

The following overflow question answers 90% of my question, but it only focuses on Full uppercase words and ignores numbers and decimals, which causes issues with eg the ID.4 and XC40. (Extract uppercase words till the first lowercase letter)

The RegEx which the above post mentions is: \b[A-Z]([A-Z ]*[A-Z])?\b It works great apart from the numbers being excluded

Here is a link to the active regexr.com I am working on. It has a few test strings to test against: https://regexr.com/6tur8

The goal is to convert (eg): This:

XC40 DIESEL ESTATE 2.0 D4 [190] Inscription Pro 5dr AWD Gea

To This:

Model: XC40 DIESEL ESTATE

Derivative: 2.0 D4 [190] Inscription Pro 5dr AWD Gea Currently it is not identifying the XC40 part.

As you can see in the regexr link, vehicles come in all variations, with one thing in common; the Model is always in UPPERCASE at the beginning of the string.

I am working in PHP to split the string using preg_split, and I presume that RegEx is the best solution for this?

CodePudding user response：

As a result of the discussion in the comments (getting a list of models would be the best option, but is impossible) my best try is this:

^((\d|([A-Z0-9\.]*[A-Z][A-Z0-9\.]*)) )*([A-Z]{3,})

It has to end in an uppercase-only word with at least 3 letters ([A-Z]{3,}) and before that matches any number of times either a single digit \d or any word that contains at least one uppercase letter [A-Z] and any number of these characters before or after the uppercase letter [A-Z0-9\.].

https://regex101.com/r/35fILx/1

It will probably need a lot of updating for edge-cases in the other models! In the worst case you can use this regex for most cases, and then hand-code the bad edge cases.

CodePudding user response：

If you don't want to cross matching a digit-dot-digit and ending on 3 or more uppercase chars:

^[A-Z\d](?:(?!\d \.\d)[A-Z\d.\h])*[A-Z]{3,}

Explanation

^ Start of string
[A-Z\d] Match a single char A-Z or a digit
(?: Non capture group to repeat as a whole part
- (?!\d \.\d) Assert not 1 digits . and 1 digit to the right
- [A-Z\d.\h] Match 1 char of the listed, where \h matches a horizontal whitespace char
)* Close the non capture group and optionally repeat it
[A-Z]{3,} Match 3 or more uppercase chars

See a regex demo.

Or written a bit more efficient without the negative lookahead:

^(?:\d\h |[\d.]*[A-Z][A-Z\d.]*\h )*[A-Z]{3,}

Explanation

^ Start of string
(?: Non capture group
- \d\h Match a single digit and 1 spaces
- | Or
- [\d.]*[A-Z][A-Z\d.]*\h Match at least 1 uppercase char A-Z between optional digits and dots
)* Close the non capture group and optionally repeat it
[A-Z]{3,} Match 3 or more uppercase chars

See another regex demo.