Home > Back-end >  How to regex extract CAR MAKE from URL in pandas df column
How to regex extract CAR MAKE from URL in pandas df column

Time:11-08

I am trying to extract from URL str "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte..." the entire Make name, i.e. "Mercedes-Benz" BUT my pattern only returns the first letter, i.e. "M"

Please help me come up with the correct pattern to use on pandas df.

Thank you

CODE: URLS_by_City['Make'] = URLS_by_City['Page'].str.extract('. ([A-Z])\w (?=[\/]) ', expand=True) Clean_Make = URLS_by_City.dropna(subset=["Make"]) Clean_Make # WENT FROM 5K rows --> to 2688 rows

    Page    City    Pageviews   Unique Pageviews    Avg. Time on Page   Entrances   Bounce Rate % Exit  **Make**
71  /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte...   San Jose    310 149 00:00:27    149 2.00%   47.74%  **B**
103 /used/Audi/2015-Audi-SQ5-286f67180a0e09a872992...   Menlo Park  250 87  00:02:36    82  0.00%   32.40%  **A**
158 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte...   San Francisco   202 98  00:00:18    98  2.04%   48.02%  **B**
165 /used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cf...   San Francisco   194 93  00:00:42    44  2.22%   29.38%  **A**
168 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte...   (not set)   192 91  00:00:11    91  2.20%   47.40%  **B**
... ... ... ... ... ... ... ... ... ...
4995    /used/Subaru/2019-Subaru-Crosstrek-5717b3040a0...   Union City  10  3   00:02:02    0   0.00%   30.00%  **S**
4996    /used/Tesla/2017-Tesla-Model S-15605a190a0e087...   San Jose    10  5   00:01:29    5   0.00%   50.00%  **T**
4997    /used/Tesla/2018-Tesla-Model 3-0f3ea14d0a0e09a...   Las Vegas   10  4   00:00:09    2   0.00%   40.00%  **T**
4998    /used/Tesla/2018-Tesla-Model 3-0f3ea14d0a0e09a...   Austin  10  4   00:03:29    2   0.00%   40.00%  **T**
4999    /used/Tesla/2018-Tesla-Model 3-5f29cdc70a0e09a...   Orinda  10  4   00:04:00    1   0.00%   0.00%   **T**

TRIED: `example_url = "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter 2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1" pattern = ". ([a-zA-Z0-9()])\w (?=[/]) "

wanted_make = URLS_by_City['Page'].str.extract(pattern)
wanted_make

` 0 0 r 1 r 2 NaN 3 NaN 4 r ... ... 4995 r 4996 l 4997 l 4998 l 4999 l

It worked in regex online tool.

but unfortunately not in my jupyter notebook

EXAMPLE PATTERNS - I bolded what should match: /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter 2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1 /used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cff998e0f96e.htm /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter 2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1 /used/Audi/2021-Audi-RS 5-b92922bd0a0e09a91b4e6e9a29f63e8f.htm /used/LEXUS/2018-LEXUS-GS 350-dffb145e0a0e09716bd5de4955662450.htm /used/Porsche/2014-Porsche-Boxster-0423401a0a0e09a9358a179195e076a9.htm /used/Audi/2014-Audi-A6-1792929d0a0e09b11bc7e218a1fa7563.htm /used/Honda/2018-Honda-Civic-8e664dd50a0e0a9a43aacb6d1ab64d28.htm /new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric /used-inventory/index.htm /new-inventory/index.htm /new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric /

CodePudding user response:

I would use:

URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'([^/] )/\d{4}\b')

This targets the URL path segment immediately before the portion with the year. You could also try this version:

URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'/[^/] /([^/] )')

CodePudding user response:

I have tried completing your requirement in Jupyter Notebook.

PFB the code and screenshots:

  1. I have created a dummy pandas dataframe(data_df), below is a screenshot of the same

enter image description here

  1. I have created a pattern based on the pattern of the string to be extracted

    pattern = "^/used/(.*)/(?=[20][0-9{2}])"

  2. Used the patten to extract required data from the URLs and saved it in another column in the same dataframe

    data_df['Car Maker'] = data_df['urls'].str.extract(pattern)

  3. Below is a screenshot of the output

enter image description here

I hope this is helpful..

  • Related