I'm a newbie in this community and I hope you can help me with my Problem. In my current project I want to scrape a page. These are gas stations with multiple information. Now all the information from the petrol stations is stored as one variable. However, I want each gas station to have a row so that I get a large data frame. Each individual gas station is provided with an id and they are stored in the variable ids.
ids=results["objectID"].tolist()
id_details=[]
for i,id in enumerate(ids):
input_dict = {
'diff_time_zone':-1,
'objectID':id,
'poiposition':'50.5397219 8.7328552',
'stateAll':'2',
'category':1,
'language':'de',
'prognosis_offset':-1,
'windowSize':305
}
encoded_input_string = json.dumps(input_dict, indent=2).encode('utf-8')
encoded_input_string = base64.b64encode(encoded_input_string).decode("utf-8")
r = s.post("https://example.me/getObject_detail.php", headers=headers, data="post=" encoded_input_string)
soup = BeautifulSoup(r.text, "lxml")
lists= soup.find('div', class_='inside')
rs= lists.find_all("p")
final = []
for lists in rs:
txt = lists if type(lists) == NavigableString else lists.text
id_details.append(txt)
df= pd.DataFrame(id_details,columns = ['place'])
CodePudding user response:
well, personally I would use a database rather than a data frame in that case and probably not saving as a file. as I can see there is dictionary-based data that can be easily implemented in Elastic Search for example.
If there is any reason(that forced not using any kind of databases) for doing that (Using Dataframe) accessing to file and appending it to the end of it would work fine, and you should maximize your chunks because accessing to file and writing to it is working like a bottleneck here but saying chunk because Ram is not unlimited.
---Update asking for the second way.
Some parts of your code are missing but u will get the idea.
file_name = 'My_File.csv'
cols = ['place'] # e.x creating an empty csv with only one column - place using pandas
data = dict(zip(cols,[[] for i in range(len(cols))]))
df = pd.DataFrame(data) #creating df
df.to_csv(file_name, mode='w', index=False, header=True) #saving
id_details={'place':[]}
for i, id in enumerate(ids):
#Some algo...
for lists in rs:
id_details['place'].append(txt)
if i 0==0:
df_main = pd.DataFrame(id_details)
df_main.to_csv(file_name, mode='a', index=False, header=False)
id_details['place'] = []
df_main = pd.DataFrame(id_details)
df_main.to_csv(file_name, mode='a', index=False, header=False)