Home > Software design >  Extracting Model and VIN from URL (second time pattern appears, but with date in addition to make)
Extracting Model and VIN from URL (second time pattern appears, but with date in addition to make)

Time:11-10

Tried:

# model_pattern = r'\d{4}\-([^/] )\-'
model_pattern = r'[-]([^/] )\-'

WANT MODEL: 2021-Mercedes-Benz-Sprinter 2500
AND VIN:
286f67180a0e09a8729929613aac3877
FROM: /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter 2500-286f67180a0e09a8729929613aac3877.htm

Another one, this one has no " " in it: /used/Audi/2015-Audi-SQ5-286f67180a0e09a8729929613aac3877.htm

I use

Clean_Make["Model"] = Clean_Make["Page"].str.extract(model_pattern)
Clean_Make

This is the resulting table:

    Page    City    Pageviews   Unique Pageviews    Avg. Time on Page   Entrances   Bounce Rate % Exit  Make1   Make2   Make    Model
71  /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte...   San Jose    310 149 00:00:27    149 2.00%   47.74%  Mercedes-Benz   Mercedes-Benz   Mercedes-Benz   Mercedes-Benz-Sprinter 2500
103 /used/Audi/2015-Audi-SQ5-286f67180a0e09a872992...   Menlo Park  250 87  00:02:36    82  0.00%   32.40%  Audi    Audi    Audi    Audi-SQ5
158 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte...   San Francisco   202 98  00:00:18    98  2.04%   48.02%  Mercedes-Benz   Mercedes-Benz   Mercedes-Benz   Mercedes-Benz-Sprinter 2500
165 /used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cf...   San Francisco   194 93  00:00:42    44  2.22%   29.38%  Audi    Audi    Audi    Audi-S8
168 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte...   (not set)   192 91  00:00:11    91  2.20%   47.40%  Mercedes-Benz   Mercedes-Benz   Mercedes-Benz   Mercedes-Benz-Sprinter 2500
... ... ... ... ... ... ... ... ... ... ... ... ...
4995    /used/Subaru/2019-Subaru-Crosstrek-5717b3040a0...   Union City  10  3   00:02:02    0   0.00%   30.00%  Subaru  Subaru  Subaru  Subaru-Crosstrek
4996    /used/Tesla/2017-Tesla-Model S-15605a190a0e087...   San Jose    10  5   00:01:29    5   0.00%   50.00%  Tesla   Tesla   Tesla   Tesla-Model S

CodePudding user response:

You can use

/([^/] )-([a-f0-9]{32})\.htm

See the regex demo.

Details:

  • / - a / char
  • ([^/] ) - Group 1 (model): one or more chars other than /
  • - - a hyphen
  • ([a-f0-9]{32}) - Group 2 (VIN): 32 hex chars
  • \.htm - a .htm string.

In Pandas, you can use

Clean_Make[["Model", "VIN"]] = Clean_Make["Page"].str.extract(r'/([^/] )-([a-f0-9]{32})\.htm', expand=False)
  • Related