Home > database >  separating list items python
separating list items python

Time:10-08

Hi please anyone can help me with this list i want to separate the data into three parts, the whole data below is located at a single index of the list, such that each index of the list has a data of this kind.

[website='https://stackoverflow.com/questions/20084356/python-3-email-extracting-search-engine' 
page_url='https://stackoverflow.com/questions/20084356/python-3-email-extracting-search-engine?answertab=active#tab-top'
data={'email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']}
]

such that i'll be able to fetch the datas independently as:

website='https://stackoverflow.com/questions/20084356/python-3-email-extracting-search-engine' 
page_url='https://stackoverflow.com/questions/20084356/python-3-email-extracting-search-engine?answertab=active#tab-top'
data={'email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']}

In my script i've being able to convert the list to string and split it again but i'm not getting the right answer

from extract_emails import DefaultFilterAndEmailAndLinkedinFactory as Factory
from extract_emails import DefaultWorker
from extract_emails.browsers.requests_browser import RequestsBrowser as Browser

browser = Browser()
print('Scraping.....')
# url = 'https://en.wikipedia.org/'
url = 'https://stackoverflow.com/questions/20084356/python-3-email-extracting-search-engine'
factory = Factory(website_url=url, browser=browser, depth = 1, max_links_from_page=5)
worker = DefaultWorker(factory)
data = worker.get_data()
# ------------convert the data to a string----------#
part1 = str(data[3])
print(part1)
#-convert string to a list------#
list1 = list(part1.split())
print(list1)
#-------------------------#
value1 = list1[0]
value2 = list1[1]
value3 = list1[2]
print(value1)
print(value2)
print(value3)

but after implementing this logic above i'm getting the result as this, which is cutting off the email part which i need:

website='https://stackoverflow.com/questions/20084356/python-3-email-extracting-search-engine'
page_url='https://stackoverflow.com/questions/20084356/python-3-email-extracting-search-engine?answertab=active#tab-top'
data={'email':

CodePudding user response:

Start with splitlines() to split the original data into lines. Then split each line at the = delimiters.

Use ast.literal_eval() to parse the strings and dictionaries into Python objects.

import ast

lines = part1.splitlines()
website = ast.literal_eval(lines[0].split('=', 1)[1])
page_url = ast.literal_eval(lines[1].split('=', 1)[1])
data = ast.literal_eval(lines[2].split('=', 1)[1])

CodePudding user response:

    splitting based on website, page_url,data.
    
    from extract_emails import DefaultFilterAndEmailAndLinkedinFactory as Factory
    from extract_emails import DefaultWorker
    from extract_emails.browsers.requests_browser import RequestsBrowser as Browser
    
    browser = Browser()
    print('Scraping.....')
    # url = 'https://en.wikipedia.org/'
    url = 'https://stackoverflow.com/questions/20084356/python-3-email-extracting-search-engine'
    factory = Factory(website_url=url, browser=browser, depth = 1, max_links_from_page=5)
    worker = DefaultWorker(factory)
    data = worker.get_data()
    # ------------convert the data to a string----------#
    part1 = str(data[3])
    #list
    strs_list=[]
    web=part1.split("page_url")[0]
    strs_list.append(web)
    part1=part1.replace(web,"")
    page_url=part1.split("data")[0]
    strs_list.append(page_url)
    part1=part1.replace(page_url,"")
    strs_list.append(part1)
    for i in strs_list:
        print(i)

***Output:***
Scraping.....
website='https://stackoverflow.com/questions/20084356/python-3-email-extracting-search-engine' 
page_url='https://stackoverflow.com/questions/20084356/python-3-email-extracting-search-engine?answertab=active#tab-top' 
data={'email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']}
  • Related