How to split scraped text and create dataframe?-CodePudding

Below is my code.

import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
r = requests.get("https://www.gutenberg.org/browse/scores/top")
soup =   BeautifulSoup(r.content,"lxml")
List1 = soup.find_all('ol')
List1

newlist = []
for List in List1:
    ulList = List.find_all('li')
    extend_list = []
    for li in ulList:
        #extend_list = []
        for link in li.find_all('a'):
            a = link.get_text()
        print(a)

my output is

I want to convert the output into list of list

[['A Room with a View by E. M.  Forster (37480)'], ['Middlemarch by George Eliot (34900)'],['Little Women; Or, Meg, Jo, Beth, and Amy by Louisa May Alcott (31929)']]

Split the list into two parts

[["A Room with a View by E. M.  Forster", "37480"], ["Middlemarch by George Eliot", "34900"],["Little Women; Or, Meg, Jo, Beth, and Amy by Louisa May Alcott", "31929"]]

Load the data into data frame

CodePudding user response：

Simplify your code, while selecting your elements more specific:

for e in soup.select('ol a'):
    data.append({
        'Ebook':e.text.split('(')[0].strip(),
        'Code':e.text.split('(')[-1].strip(')')
    })

Example

import requests
import pandas as pd
from bs4 import BeautifulSoup
r = requests.get("https://www.gutenberg.org/browse/scores/top")
soup =   BeautifulSoup(r.content,"lxml")

data = []

for e in soup.select('ol a'):
    data.append({
        'Ebook':e.text.split('(')[0].strip(),
        'Code':e.text.split('(')[-1].strip(')')
    })
pd.DataFrame(data)

Output

	Ebook	Code
0	A Room with a View by E. M. Forster	37480
1	Middlemarch by George Eliot	34900
2	Little Women; Or, Meg, Jo, Beth, and Amy by Louisa May Alcott	31929
3	The Enchanted April by Elizabeth Von Arnim	31648
4	The Blue Castle: a novel by L. M. Montgomery	30646
5	Moby Dick; Or, The Whale by Herman Melville	30426
6	The Complete Works of William Shakespeare by William Shakespeare	30266

...

CodePudding user response：

You can do it in one step with a short regex and str.extract:

df = (pd.Series([e.text for e in soup.select('ol a')])
        .str.extract(r'(.*) \((\d )\)$')
        .set_axis(['Ebooks', 'Code'], axis=1)
     )

If you need the intermediate list of lists:

import re

L = [list(m.groups()) for e in soup.select('ol a')
     if (m:=re.search(r'(.*) \((\d )\)$', e.text))]

df = pd.DataFrame(L, columns=['Ebooks', 'Code'])

output:

                                                Ebooks   Code
0                 A Room with a View by E. M.  Forster  37480
1                          Middlemarch by George Eliot  34900
2    Little Women; Or, Meg, Jo, Beth, and Amy by Lo...  31929
3           The Enchanted April by Elizabeth Von Arnim  31648
4        The Blue Castle: a novel by L. M.  Montgomery  30646
..                                                 ...    ...
395                           Hapgood, Isabel Florence  12240
396                                  Mill, John Stuart  12223
397                               Marlowe, Christopher  11760
398                                     Wharton, Edith  11728
399                           Burnett, Frances Hodgson  11630

[400 rows x 2 columns]