Home > Enterprise >  Methods to manipulate entries in a list
Methods to manipulate entries in a list

Time:04-12

I've hit my next road block. I've retrieved the URLs for the images I'd like to download. The problem is they have parameters to shrink the images to thumbnail size:

['https://www.lego.com/cdn/cs/set/assets/blt92a894b291b4c966/21054.jpg?fit=bounds&format=jpg&quality=80&width=65&height=45&dpr=1', 
...
'https://www.lego.com/cdn/cs/set/assets/bltea2ebe53c7c18194/21054_alt14.jpg?fit=bounds&format=jpg&quality=80&width=65&height=45&dpr=1']

I'd like to strip the "[" at the beginning, the "]: at the end, and everything after the "?" in each link.

I tried to use strip, but that didn't work because it's a list.

I then read somewhere to use pandas and that's making my head spin. Specifically, how is the value of each row in the column passed to a variable? Also any, pointers regarding how to strip the aforementioned characters would be great. I'm still tinkering with it.

Complete Code for reference:

import io
from os import link
from re import search
from typing import Counter
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import time
import bs4
import os
import wget
import requests
from PIL import Image
import pandas as pd
set_number = "21054"

#specify the path to chromedriver.exe (download and save on your computer)
driver = webdriver.Chrome('/Users/ibrahiemk/Downloads/chromedriver')

#open the webpage
driver.get("http://shop.lego.com")

#alert 1
button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="root"]/div[5]/div/div/div[1]/div[1]/div/button'))).click()
#Button 
button2 = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//button[normalize-space()='Just Necessary']"))).click()
#target Search
search = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//*[@id='root']/div[2]/header/div[2]/div[2]/div/div[5]/div/button"))).click()
searchbox = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH, "//*[@id='desktop-search-search-input']")))

searchbox.send_keys(set_number)

#Click the resulting set
searchbox = WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH, "//*[@id='desktop-search-search-suggestions']/li/a/div"))).click()

anchors = driver.find_elements(By.XPATH, '//*[@id="main-content"]/div/div[1]/div/div[1]/div[1]/div/div/div/div[2]/div/div/div/ol/li/button/img')
links = [a.get_attribute('src') for a in anchors]
df = pd.DataFrame(links, columns=['links'])
df ['links'] = df['links'].str.rstrip('?.*$')
links = links.values[0]
print(df)
I know the df stuff is broken, still tinkering with it. TIA!

CodePudding user response:

You can do something like this:

urls = ["https://example.com/bar1.jpg?query-string",
        "https://example.com/bar2.jpg?query-string", 
        "https://example.com/bar3.jpg?query-string"]

stripedUrls = []
for url in urls:
    stripedUrl = url.split("?")[0]
    stripedUrls.append(stripedUrl)

Or maybe the following one-liner if you prefer:

stripedUrls = [url.split("?")[0] for url in urls]

For more query-string striping ways, see How do I remove a query string from URL using Python.

  • Related