I have a page url that I am looking to pull data from using Python.
I basically want to return the paragraph data found 'under' a h2 element. Issue is the content is not nested and there are no real classes/id's on any of the content.
Structure of the content I want to pull:
<h2>Heading text</h2>
<p>Text I want to get</p>
<p>Text I want to get</p>
<p>Text I want to get</p>
<h2>Heading text 2</h2>
<p>Text 2 I want to get</p>
<p>Text 2 I want to get</p>
<p>Text 2 I want to get</p>
Output I want is an array object of h2 / paragraphs.
Expected Output for the first h2
<p>Text I want to get</p>
<p>Text I want to get</p>
<p>Text I want to get</p>
Then cycle to the second h2 and return
<p>Text 2 I want to get</p>
<p>Text 2 I want to get</p>
<p>Text 2 I want to get</p>
At the moment I can get all the h2 and paragraphs separately but can't figure out how to only return paragraphs for the first h2 then cycle to the second h2 and do the same.
Current code snippets I have tried (both of which return all the paragraphs
import requests
from bs4 import BeautifulSoup, NavigableString, Tag
# Make a request
page = requests.get(
"https://www.obd-codes.com/p0100")
soup = BeautifulSoup(page.content, 'html.parser')
# Create all_h1_tags as empty list
all_h1_tags = []
# Set all_h1_tags to all h1 tags of the soup
for element in soup.select('h1'):
all_h1_tags.append(element.text)
# Create seventh_p_text and set it to 7th p element text of the page
all_h2_tags = []
for element in soup.select('h2'):
all_h2_tags.append(element.text)
all_p_tags = []
for element in soup.select('p'):
all_p_tags.append(element.text)
print(all_h1_tags, all_h2_tags, all_p_tags)
And this one
import requests
from bs4 import BeautifulSoup, NavigableString, Tag
# Make a request
page = requests.get(
"https://www.obd-codes.com/p0100").text
soup = BeautifulSoup(page, 'html.parser')
for header in soup.find_all('h2'):
nextNode = header
while True:
nextNode = nextNode.nextSibling
if nextNode is None:
break
if isinstance(nextNode, NavigableString):
print (nextNode.strip())
if isinstance(nextNode, Tag):
if nextNode.name == "h2":
break
print (nextNode.get_text(strip=True).strip())
CodePudding user response:
You can use tag.find_previous
to get previous <h2>
element. For example:
import requests
from bs4 import BeautifulSoup
url = "https://www.obd-codes.com/p0100"
soup = BeautifulSoup(requests.get(url).content, "lxml")
out = {}
for tag in soup.select(".main > *:not(h1, h2, #ads)"):
prev_h2 = tag.find_previous("h2")
text = tag.get_text(strip=True, separator="\n")
if text not in ("", "Share"): # do some basic filtering
out.setdefault(prev_h2.text, []).append(text)
print(out)
Prints:
{
"Technical Description": [
"Mass or Volume Air Flow (MAF) Circuit Malfunction"
],
"What does that mean?": [
"This diagnostic trouble code (DTC) is a generic powertrain code, which means that it applies to OBD-II equipped vehicles that have a mass airflow sensor. Brands include but are not limited to Toyota, Nissan, Vauxhall, Mercedes Benz, Mitsubishi, VW, Saturn, Ford, Jeep, Jaguar, Chevy, Infiniti, etc. Although generic, the specific repair steps may vary depending on make/model.",
"The MAF (mass air flow) sensor is a sensor mounted in a vehicle's engine air intake tract downstream from the air filter, and is used to measure the volume and density of air being drawn into the engine. The MAF sensor itself only measures a portion of the air entering and that value is used to calculate the total volume and density of air being ingested.",
"The powertrain control module (PCM) uses that reading along with other sensor parameters to ensure proper fuel delivery at any given time for optimum power and fuel efficiency.",
"This P0100 diagnostic trouble code (DTC) means that there is a detected problem with the Mass Air Flow (MAF)\nsensor or circuit. The PCM detects that the actual MAF sensor frequency signal\nis not performing within the normal expected range of the calculated MAF value.",
"Note: Some MAF sensors also incorporate an air temperature sensor, which is another value used by the PCM for optimal engine operation.",
"Closely related MAF circuit trouble codes include:",
'P0101\nMass or Volume Air Flow "A" Circuit Range/Performance\nP0102\nMass\nor Volume Air Flow "A" Circuit Low Input\nP0103\nMass\nor Volume Air Flow "A" Circuit High Input\nP0104\nMass or Volume Air Flow "A" Circuit Intermittent',
"Photo of a MAF sensor:",
],
"What are some possible symptoms?": [
"Symptoms of a P0100 code may include:",
"Malfunction indicator lamp (MIL) illumination (a.k.a. check engine light)\nRough running engine\nBlack smoke from tail pipe\nStalling\nEngine hard start or stalling after it starts\nPossible other driveability symptoms or even no symptoms",
],
"What are some potential causes?": [
"Potential causes for this trouble code may include:",
"Dirty or contaminated mass air flow sensor\nFailed MAF sensor\nIntake air leaks\nMAF sensor electrical harness or wiring problem (open, shorted, frayed, poor connection, etc.)",
'Note that other codes may be present if you have a P0101. You may have misfire codes or O2 sensor codes, so it\'s important to take a "big picture" look at how the systems work together and effect each other when doing a diagnosis.',
],
"What can I do to diagnose and repair a P0100 engine code?": [
"Visually inspect all MAF sensor wiring and connectors to make sure they are intact, not frayed, broken, routed too close to ignition wires/coils, relays, motors, etc.\nVisually inspect for any obvious air leaks in the air intake system\nVisually *closely* inspect the MAF sensor wires or film to see if you can see contamination such as dirt, dust, oil, etc.\nIf the air filter is dirty, replace it with a new original equipment filter from the dealer\nCarefully clean the MAF using\nMAF cleaner spray\nis generally a good DIY friendly diagnostic/repair step\nIf the air intake system has a mesh in it, make sure that is also clean (VWs mainly)\nLoss of vacuum to the MAP sensor can trigger this DTC\nA low minimum air rate through the sensor bore may cause this DTC to set\nat idle or during deceleration. Inspect for any vacuum leaks downstream\nof the MAF sensor.\nUse a scan tool to monitor real-time sensor values from the MAF sensor, O2 sensors, etc.\nCheck for Technical Service Bulletins (TSBs) for your particular make/model in case of known issues on your vehicle\nThe barometric pressure (BARO) that is used in order to calculate the predicted\nMAF value is initially based on the\nMAP sensor\nat key ON.\nA high resistance on the ground circuit of the\nMAP\nsensor\ncan cause this DTC to set",
"If you do need to replace the MAF sensor, we recommend using an original equipment OEM one from the manufacturer rather than buying an aftermarket part.",
"Note: The use of a reusable oiled air filter could be a cause of this code, if it is over-oiled. Oil can transfer to the fine wire or film inside the MAF sensor and contaminate it. Use something such as\nMAF cleaner spray\nto clean the MAF in such situations. We do not recommend the use of oiled air filters.",
],
"Related DTC Discussions": [
"Register now to ask a question (free)",
"2003 Dakota 3.9l 4x4 P0100\nI am getting this code but it is not valid for my truck.. This is for Mass air flow sensor. I have a MAP sensor, no MAF. I replaced the MAP just in case. Dodge actually told me to do this. This can be cleared a few times using the reader, then it stays and won't won't clear. The truck will then not ...\nP0100,102,103,104,105 &106 S-10 99' 4.3\nBoth MAF & MAP sensor codes. I did a volt test on both. It would seem that I have 0v on my ground going into the MAP. PCM is supplying 4.7, but 3rd wire is only getting 3.5 NOT 5v. Getting only 3v on my MAF also, and I think 0v on my ground there as well. Recently replaced fuel pump and grounded...\nP0100, P0325 & P304 on 97' Nissan Altima\nI have a 97' Nissan Altima with nearly 200K miles. The car has been reliable but it sputters, hesitates and sometimes dies at low RPM. The problem codes are P0100, P0325 & P0304. Is the engine misfire causing the faulty Knock sensor? or could I get away with just changing the Knock sensor to...\n2004 SSR codes p0100 p0171 p0172 p0174 p0175 p0300 p0420 u1041\nTo Start I have a 2004 Chevrolet SSR with 129,000 miles. I have been having trouble since Winter 2014. It started when I went out to start my SSR after sitting in Garage in sub -45 temps. Started ok. But just as heat started to come out of Heater. P-0420. I shut down & waited till spring to star...\n2008 Mazda 3 multi codes (P0167, P0033, P0100, P0169)\nHello,\nI have a mazda 3 2008 110ch 1.6di turbo.\nMy car was ok but problem for 2 weeks.\nMy car start normally, the engine idle power holds well but if i accelerate, the engine stalls and cuts itself off. Before I can restart engine without problems but same problem if i accelerate. The diagnostic ca...\nenhanced p0100 dilema 2002 Dakota\nI have a 02 Dakota an had to check codes for evap an fixed it .there is no check engine light now...But a odd hidden code I pulled up in the \"dodge enhanced\" mode ..engine off but key on is saying P0100-mass air flow sensor --I Don't Have One.\nWhen engine is in run mode an enhanced mode I get coola...\n99 NISSAN FRONTIER CODE P0100, and loss of RPM\nI am working on my 1999 NISSAN FRONTIER XE, check engine code reads P0100: Mass Air Flow (MAF) Circuit Malfunction. Checked the wiring, no frays, or loss connections. Air filter is only a few months old.\nWhen sitting at idle, the engine is ok, once you start to decrease the RPM's. when you get down...\nNissan Altima GXE P0100 and stops at idle...does it\nI have Nissan GXE 2001 bought on 2005 w/ 75000 miles on it at that time. It has almost 110000 miles now. I have regular oil change every 3 months. Regular tire rotation and wheel balance. I just had coolant pump replaced two months before. I am having new problem now. At one point, at a turning (at ...\nChevy K1500 P0300 & P0100\nI have a 97 Chevy K1500 and I am in a bind. I started to get a rough idle and I replaced the spark plugs, that didn't help, the check engine light would come on and off intermittenly. I then discovered I had a bad converter, I replaced that. I didn't have a rough idle but it was hard to start. I not...\nToyota Land Cruiser P0100;P0110;P1115;P1405\nHi all,\nMaybe someone can help me with this combination of fault codes.\nCar: Toyota Land Cruiser 3.0 D4-D (2001)\nDriving symptoms:\n- small flutuation on acelaration when motor is with almost no charge (i.e. cruising at a constant speed 50 mph on leveled road). This is more noticeable when the en...",
],
"Need more help with a p0100 code?": [
"If you still need help regarding the P0100 trouble code, please\npost\nyour question in our FREE car repair forums\n.",
"NOTE: This information is presented for information purposes only.\nIt is not intended as repair advice and we are not responsible for any actions\nyou take on any vehicle. All information on this site is copyright protected.",
],
}
CodePudding user response:
The second approach looks quiet close to a solution, but I think you can make it a bit simpler. So select your <h2>
iterate over them. While iterating call its next_siblings
and check if tag.name
is p
else break your loop:
...
for e in soup.find_all('h2'):
print('# ' e.text)
for s in e.find_next_siblings():
if s.name == 'p':
print(s.get_text(strip=True))
else:
print('-----------------')
break
...
Note: if you like to scrape additional information you have to adjust a bit and check for tags that are needed, you also could invert your condition:
...
for e in soup.find_all('h2'):
print('# ' e.text)
for s in e.find_next_siblings():
if s.name == 'h2':
print('-----------------')
break
else:
print(s.get_text(strip=True))
...
Example
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.obd-codes.com/p0100").text
soup = BeautifulSoup(page)
for e in soup.find_all('h2'):
print('# ' e.text)
for s in e.find_next_siblings():
if s.name == 'p':
print(s.get_text(strip=True))
else:
print('-----------------')
break
Output
# Technical Description
Mass or Volume Air Flow (MAF) Circuit Malfunction
-----------------
# What does that mean?
This diagnostic trouble code (DTC) is a generic powertrain code, which means that it applies to OBD-II equipped vehicles that have a mass airflow sensor. Brands include but are not limited to Toyota, Nissan, Vauxhall, Mercedes Benz, Mitsubishi, VW, Saturn, Ford, Jeep, Jaguar, Chevy, Infiniti, etc. Although generic, the specific repair steps may vary depending on make/model.
The MAF (mass air flow) sensor is a sensor mounted in a vehicle's engine air intake tract downstream from the air filter, and is used to measure the volume and density of air being drawn into the engine. The MAF sensor itself only measures a portion of the air entering and that value is used to calculate the total volume and density of air being ingested.
...