I got this code to almost work, despite much ignorance. Please help on the home run!
- Problem 1: INPUT:
I have a long list of URLs (1000 ) to read from and they are in a single column in .csv. I would prefer to read from that file than to paste them into code, like below.
- Problem 2: OUTPUT:
The source files actually have 3 drivers and 3 challenges each. In a separate python file, the below code finds, prints and saves all 3, but not when I'm using this dataframe below (see below - it only saves 2).
- Problem 3: OUTPUT:
I want the output (both files) to have URLs in column 0, and then drivers (or challenges) in the following columns. But what I've written here (probably the 'drop') makes them not only drop one row but also move across 2 columns.
At the end I'm showing both the inputs and the current & desired output. Sorry for the long question. I'll be very grateful for any help!
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data = []
for x in toc.select('li:-soup-contains-own("Market drivers") li'):
data.append(x.get_text(strip=True))
df = pd.DataFrame(data, columns=[url])
dataframes.append(pd.DataFrame(df).drop(0, axis=0))
df2 = pd.concat(dataframes)
tdata = df2.T
tdata.to_csv(f'detail-dr.csv', header=True)
get_drivers()
def get_challenges():
data = []
for y in toc.select('li:-soup-contains-own("Market challenges") li'):
data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
df = pd.DataFrame(data, columns=[url])
dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
df2 = pd.concat(dataframes2)
tdata = df2.T
tdata.to_csv(f'detail-ch.csv', header=True)
get_challenges()
The inputs look like this in each URL. They are just lists:
Market drivers
- Growing investment in fabs
- Miniaturization of electronic products
- Increasing demand for IoT devices
Market challenges
- Rapid technological changes in semiconductor industry
- Volatility in semiconductor industry
- Impact of technology chasm Table Impact of drivers and challenges
My desired output for drivers is:
0 | 1 | 2 | 3 |
---|---|---|---|
http/.../Global-Induction-Hobs-30196623/ | Product innovations and new designs | Increasing demand for convenient home appliances with changes in lifestyle patterns | Growing adoption of energy-efficient appliances |
http/.../Global-Human-Capital-Management-30196628/ | Demand for automated recruitment processes | Increasing demand for unified solutions for all HR functions | Increasing workforce diversity |
http/.../Global-Probe-Card-30196643/ | Growing investment in fabs | Miniaturization of electronic products | Increasing demand for IoT devices |
But instead I get:
0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
http/.../Global-Induction-Hobs-30196623/ | Increasing demand for convenient home appliances with changes in lifestyle patterns | Growing adoption of energy-efficient appliances | ||||
http/.../Global-Human-Capital-Management-30196628/ | Increasing demand for unified solutions for all HR functions | Increasing workforce diversity | ||||
http/.../Global-Probe-Card-30196643/ | Miniaturization of electronic products | Increasing demand for IoT devices |
CodePudding user response:
Store your data in a list of dicts, create a data frame from it. Split the list of drivers
/ challenges
into single columns
and concat it to the final data frame.
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data.append({
'url':url,
'type':'driver',
'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
})
get_drivers()
def get_challenges():
data.append({
'url':url,
'type':'challenges',
'list':[x.text.replace('Table Impact of drivers and challenges','') for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if x.text != 'Table Impact of drivers and challenges']
})
get_challenges()
pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)#.to_csv(sep='|')
Output
url | type | 0 | 1 | 2 |
---|---|---|---|---|
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ | driver | Product innovations and new designs | Increasing demand for convenient home appliances with changes in lifestyle patterns | Growing adoption of energy-efficient appliances |
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ | challenges | High cost limiting the adoption in the mass segment | Health hazards related to induction hobs | Limitation of using only flat - surface utensils and induction-specific cookware |
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ | driver | Demand for automated recruitment processes | Increasing demand for unified solutions for all HR functions | Increasing workforce diversity |
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ | challenges | Threat from open-source software | High implementation and maintenance cost | Threat to data security |
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ | driver | Growing investment in fabs | Miniaturization of electronic products | Increasing demand for IoT devices |
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ | challenges | Rapid technological changes in semiconductor industry | Volatility in semiconductor industry | Impact of technology chasm |