Home > Blockchain >  Convert strings in a list to dataframe - Python
Convert strings in a list to dataframe - Python

Time:11-15

I have scraped the necesary items from a PDF to convert it to a dataframe, but im having a hard time to correctly organizating the rows and columns.

# open the PDF as an object and read it into PyPDF2.
pdfFileObj = open('/FuentesDeDatos/AltaVista-Datos/lista3-pes.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

# create page object and extract text
pageObj = pdfReader.getPage(0)
page1 = pageObj.extractText()

# strip away page header and footer
page1 = page1[295:]
page1 = page1[:-61]

#Replace Characters
page1 = page1.replace(',','.')
page1 = page1.replace('\n',',')

#Split categories
page1 = page1.split(', Código Descripción Precio final,')
page1

Output:

['001-12 Discos rígidos',
'1779 HD 1 TB SATA 3 WD BLUE 64MB WD10EZEX,$ 9.041.78,1860 HD 2 TB SATA 3 WD BLUE 64MB WD20EZAZ,$ 10.467.19,1986 HD 2 TB SATA 6 SEAGATE BARRACUDA,$ 11.588.09,3119 HD 1 TB SATA 6 SEAGATE BARRACUDA 64MB,$ 8.254.13,3121 HD 2 TB SATA 6 SEAGATE SKYHAWK 64MB SURVEILLANCE,$ 12.739.61,32173 HD 2 TB SATA 3 WD PURPURA 64MB WD22PURZ,$ 13.900.90,32176 HD 1 TB SATA 3 WD PURPURA 64MB WD10PURZ,$ 10.942.26,001-13 Microprocesadores',
'1013 MICRO AMD RYZEN 7 5700G 4.6 GHZ 8 CORE AM4,$ 65.544.62,1014 MICRO AMD RYZEN 5-5600G 4.4 GHZ 8 CORE AM4,$ 42.564.93,1017 MICRO AMD RYZEN 5-4600G 4.2GHZ/11MB/AM4,$ 34.605.73,11052 MICRO INTEL CORE I7-9700  3GHZ/12MB/LGA1151,$ 61.534.02,1122 MICRO INTEL CORE I3-10100F/3.6GHZ/6MB/SOC 1200,$ 16.502.81,1126 MICRO INTEL CELERON G5925 SOC 1200/3.6GHZ/4MB,$ 10.344.88,1130 MICRO INTEL CORE I7 10700 2.9GHZ LGA 1200,$ 68.196.69,1139 MICRO INTEL CORE I5-10400/2.9GHZ/12 MB/LGA 1200,$ 41.014.79,11820 MICRO AMD RYZEN 7 PRO 4750G OEM FAN/3.6GHZ/8MB/AM4,$ 66.166.41,11821 MICRO INTEL CORE I9 12900/30MB/LGA 1700,$ 129.068.54,11822 MICRO INTEL CORE I3-12100F/12MB/SOC 1700,$ 26.501.69,1683 MICRO INTEL CORE I5-10400F/2.9GHZ/12MB/LGA 1200,$ 32.806.31,1684 MICRO INTEL PENTIUM GOLD G6400/4GHZ/4MB/LGA 1200,$ 13.297.60,3555 MICRO INTEL CORE I7-11700/2.5GHZ/16MB/SOC 1200,$ 73.711.21,3557 MICRO INTEL CORE I3-10105/3.7GHZ/6MB/SOC 1200,$ 30.940.77,3559 MICRO INTEL CORE I3-10105F/3.7GHZ/6MB/SOC 1200,$ 16.500.74,3572 MICRO INTEL CELERON G6900 ALDERLAKE/4MB/LGA1700,$ 13.017.25,3573 MICRO INTEL CORE I7-11700F/2.5GHZ/16MB/SOC 1200,$ 67.528.69,7772 MICRO AMD RYZEN 5-4500 PRO OEM FAN/4.0GHZ/4MB/AM4,$ 30.123.06,9214 MICRO INTEL CORE I7 10700F 2.9GHZ LGA 1200,$ 61.649.55,001-14 Memorias',
'12801 SODIMM DDR3 MEMOX 1GB 1333 MHZ DDR3,$ 2.069.36,142941 DDR4 HIKVISION U1 4GB 2666 MHZ 1.2V CL19  COMBO,$ 4.289.93,1434 DDR4 CRUCIAL 16GB 2666 MHZ CB16GU2666 COMBO,$ 11.069.53,14340 DDR4 CRUCIAL BY MICRON 8GB 2666MHZ CB8GU2666 COMBO,$ 6.477.18,14342 DDR4 CRUCIAL BY MICRON 8GB 3200 MHZ CT8G4DFRA32A,$ 6.586.94,14397 DDR5 CRUCIAL 8GB 4800MHZ CT8G48C40U5 1.1V,$ 11.447.20,14398 DDR5 CRUCIAL 16GB 4800MHZ CT16G48C40U5 1.1V,$ 22.436.33,9971 DDR2 GENERICA 2GB 800 MHZ,$ 1.814.44,001-16 Motherboards',
'12255 MOTHER  BIOSTAR E1-6010 DDR3 C/RADEON A68N-2100K,$ 15.109.34,12323 MOTHER GIGABYTE H410M-H V2 SOC LGA1200,$ 16.680.33,12327 MOTHER MSI H310M PRO-VDH SOC 1151/8º GEN/DDR4,$ 11.658.20,12329 MOTHER BIOSTAR TZ590-BTC DUO SOC1200 PCIE 8 1/10 S,$ 72.498.30,1233 MOTHER GIGABYTE  B560M DS3H V2  SOC 1200,$ 22.518.52,12330 MOTHER ASROCK H510 PRO BTC  SOC1200/6 PCI-E/10ºGEN,$ 69.298.24,12331 MOTHER GIGABYTE Z590 UD AC SOC 1200/ATX,$ 41.845.47,12334 MOTHER ASUS H510M-E PRIME SOC 1200/DDR4/MICRO ATX,$ 18.708.36,12335 MOTHER GIGABYTE H610M-S2H DDR4 SOC1700/MICRO ATX,$ 20.250.42'] 

And the main idea is to convert this list of string to a dataframe with this structure. Where column 1 = category(repeats), 2 item, and 3 = price. Note that the category of each string is at the end of the previous string.

Wanted/Expected result

I have tried split(",") and convert it to a dataframe but it results in a mess.

I would really appreciate some help or guidance on how to do it. Many thanks in advance

CodePudding user response:

You can try:

import re
import pandas as pd
from itertools import groupby

page1 = [
    "001-12 Discos rígidos",
    "1779 HD 1 TB SATA 3 WD BLUE 64MB WD10EZEX,$ 9.041.78,1860 HD 2 TB SATA 3 WD BLUE 64MB WD20EZAZ,$ 10.467.19,1986 HD 2 TB SATA 6 SEAGATE BARRACUDA,$ 11.588.09,3119 HD 1 TB SATA 6 SEAGATE BARRACUDA 64MB,$ 8.254.13,3121 HD 2 TB SATA 6 SEAGATE SKYHAWK 64MB SURVEILLANCE,$ 12.739.61,32173 HD 2 TB SATA 3 WD PURPURA 64MB WD22PURZ,$ 13.900.90,32176 HD 1 TB SATA 3 WD PURPURA 64MB WD10PURZ,$ 10.942.26,001-13 Microprocesadores",
    "1013 MICRO AMD RYZEN 7 5700G 4.6 GHZ 8 CORE AM4,$ 65.544.62,1014 MICRO AMD RYZEN 5-5600G 4.4 GHZ 8 CORE AM4,$ 42.564.93,1017 MICRO AMD RYZEN 5-4600G 4.2GHZ/11MB/AM4,$ 34.605.73,11052 MICRO INTEL CORE I7-9700  3GHZ/12MB/LGA1151,$ 61.534.02,1122 MICRO INTEL CORE I3-10100F/3.6GHZ/6MB/SOC 1200,$ 16.502.81,1126 MICRO INTEL CELERON G5925 SOC 1200/3.6GHZ/4MB,$ 10.344.88,1130 MICRO INTEL CORE I7 10700 2.9GHZ LGA 1200,$ 68.196.69,1139 MICRO INTEL CORE I5-10400/2.9GHZ/12 MB/LGA 1200,$ 41.014.79,11820 MICRO AMD RYZEN 7 PRO 4750G OEM FAN/3.6GHZ/8MB/AM4,$ 66.166.41,11821 MICRO INTEL CORE I9 12900/30MB/LGA 1700,$ 129.068.54,11822 MICRO INTEL CORE I3-12100F/12MB/SOC 1700,$ 26.501.69,1683 MICRO INTEL CORE I5-10400F/2.9GHZ/12MB/LGA 1200,$ 32.806.31,1684 MICRO INTEL PENTIUM GOLD G6400/4GHZ/4MB/LGA 1200,$ 13.297.60,3555 MICRO INTEL CORE I7-11700/2.5GHZ/16MB/SOC 1200,$ 73.711.21,3557 MICRO INTEL CORE I3-10105/3.7GHZ/6MB/SOC 1200,$ 30.940.77,3559 MICRO INTEL CORE I3-10105F/3.7GHZ/6MB/SOC 1200,$ 16.500.74,3572 MICRO INTEL CELERON G6900 ALDERLAKE/4MB/LGA1700,$ 13.017.25,3573 MICRO INTEL CORE I7-11700F/2.5GHZ/16MB/SOC 1200,$ 67.528.69,7772 MICRO AMD RYZEN 5-4500 PRO OEM FAN/4.0GHZ/4MB/AM4,$ 30.123.06,9214 MICRO INTEL CORE I7 10700F 2.9GHZ LGA 1200,$ 61.649.55,001-14 Memorias",
    "12801 SODIMM DDR3 MEMOX 1GB 1333 MHZ DDR3,$ 2.069.36,142941 DDR4 HIKVISION U1 4GB 2666 MHZ 1.2V CL19  COMBO,$ 4.289.93,1434 DDR4 CRUCIAL 16GB 2666 MHZ CB16GU2666 COMBO,$ 11.069.53,14340 DDR4 CRUCIAL BY MICRON 8GB 2666MHZ CB8GU2666 COMBO,$ 6.477.18,14342 DDR4 CRUCIAL BY MICRON 8GB 3200 MHZ CT8G4DFRA32A,$ 6.586.94,14397 DDR5 CRUCIAL 8GB 4800MHZ CT8G48C40U5 1.1V,$ 11.447.20,14398 DDR5 CRUCIAL 16GB 4800MHZ CT16G48C40U5 1.1V,$ 22.436.33,9971 DDR2 GENERICA 2GB 800 MHZ,$ 1.814.44,001-16 Motherboards",
    "12255 MOTHER  BIOSTAR E1-6010 DDR3 C/RADEON A68N-2100K,$ 15.109.34,12323 MOTHER GIGABYTE H410M-H V2 SOC LGA1200,$ 16.680.33,12327 MOTHER MSI H310M PRO-VDH SOC 1151/8º GEN/DDR4,$ 11.658.20,12329 MOTHER BIOSTAR TZ590-BTC DUO SOC1200 PCIE 8 1/10 S,$ 72.498.30,1233 MOTHER GIGABYTE  B560M DS3H V2  SOC 1200,$ 22.518.52,12330 MOTHER ASROCK H510 PRO BTC  SOC1200/6 PCI-E/10ºGEN,$ 69.298.24,12331 MOTHER GIGABYTE Z590 UD AC SOC 1200/ATX,$ 41.845.47,12334 MOTHER ASUS H510M-E PRIME SOC 1200/DDR4/MICRO ATX,$ 18.708.36,12335 MOTHER GIGABYTE H610M-S2H DDR4 SOC1700/MICRO ATX,$ 20.250.42",
]

page1 = ",".join(page1).split(",")
pat = re.compile(r"\d{3}-\d ")

groups, last_group = {}, None
for k, g in groupby(page1, pat.match):
    if k:
        last_group = next(g)
        groups[last_group] = []
    else:
        groups[last_group].extend(zip(g, g))

df = pd.DataFrame(
    [
        {"Col1": k, "Col2": name, "Col3": price}
        for k, v in groups.items()
        for name, price in v
    ]
)
print(df)

Prints:

                        Col1                                                      Col2          Col3
0      001-12 Discos rígidos                 1779 HD 1 TB SATA 3 WD BLUE 64MB WD10EZEX    $ 9.041.78
1      001-12 Discos rígidos                 1860 HD 2 TB SATA 3 WD BLUE 64MB WD20EZAZ   $ 10.467.19
2      001-12 Discos rígidos                     1986 HD 2 TB SATA 6 SEAGATE BARRACUDA   $ 11.588.09
3      001-12 Discos rígidos                3119 HD 1 TB SATA 6 SEAGATE BARRACUDA 64MB    $ 8.254.13
4      001-12 Discos rígidos     3121 HD 2 TB SATA 6 SEAGATE SKYHAWK 64MB SURVEILLANCE   $ 12.739.61
5      001-12 Discos rígidos             32173 HD 2 TB SATA 3 WD PURPURA 64MB WD22PURZ   $ 13.900.90
6      001-12 Discos rígidos             32176 HD 1 TB SATA 3 WD PURPURA 64MB WD10PURZ   $ 10.942.26
7   001-13 Microprocesadores           1013 MICRO AMD RYZEN 7 5700G 4.6 GHZ 8 CORE AM4   $ 65.544.62
8   001-13 Microprocesadores           1014 MICRO AMD RYZEN 5-5600G 4.4 GHZ 8 CORE AM4   $ 42.564.93
9   001-13 Microprocesadores              1017 MICRO AMD RYZEN 5-4600G 4.2GHZ/11MB/AM4   $ 34.605.73
10  001-13 Microprocesadores         11052 MICRO INTEL CORE I7-9700  3GHZ/12MB/LGA1151   $ 61.534.02
11  001-13 Microprocesadores       1122 MICRO INTEL CORE I3-10100F/3.6GHZ/6MB/SOC 1200   $ 16.502.81
12  001-13 Microprocesadores        1126 MICRO INTEL CELERON G5925 SOC 1200/3.6GHZ/4MB   $ 10.344.88
13  001-13 Microprocesadores            1130 MICRO INTEL CORE I7 10700 2.9GHZ LGA 1200   $ 68.196.69
14  001-13 Microprocesadores      1139 MICRO INTEL CORE I5-10400/2.9GHZ/12 MB/LGA 1200   $ 41.014.79
15  001-13 Microprocesadores  11820 MICRO AMD RYZEN 7 PRO 4750G OEM FAN/3.6GHZ/8MB/AM4   $ 66.166.41
16  001-13 Microprocesadores             11821 MICRO INTEL CORE I9 12900/30MB/LGA 1700  $ 129.068.54
17  001-13 Microprocesadores            11822 MICRO INTEL CORE I3-12100F/12MB/SOC 1700   $ 26.501.69
18  001-13 Microprocesadores      1683 MICRO INTEL CORE I5-10400F/2.9GHZ/12MB/LGA 1200   $ 32.806.31
19  001-13 Microprocesadores     1684 MICRO INTEL PENTIUM GOLD G6400/4GHZ/4MB/LGA 1200   $ 13.297.60
20  001-13 Microprocesadores       3555 MICRO INTEL CORE I7-11700/2.5GHZ/16MB/SOC 1200   $ 73.711.21
21  001-13 Microprocesadores        3557 MICRO INTEL CORE I3-10105/3.7GHZ/6MB/SOC 1200   $ 30.940.77
22  001-13 Microprocesadores       3559 MICRO INTEL CORE I3-10105F/3.7GHZ/6MB/SOC 1200   $ 16.500.74
23  001-13 Microprocesadores      3572 MICRO INTEL CELERON G6900 ALDERLAKE/4MB/LGA1700   $ 13.017.25
24  001-13 Microprocesadores      3573 MICRO INTEL CORE I7-11700F/2.5GHZ/16MB/SOC 1200   $ 67.528.69
25  001-13 Microprocesadores    7772 MICRO AMD RYZEN 5-4500 PRO OEM FAN/4.0GHZ/4MB/AM4   $ 30.123.06
26  001-13 Microprocesadores           9214 MICRO INTEL CORE I7 10700F 2.9GHZ LGA 1200   $ 61.649.55
27           001-14 Memorias                 12801 SODIMM DDR3 MEMOX 1GB 1333 MHZ DDR3    $ 2.069.36
28           001-14 Memorias    142941 DDR4 HIKVISION U1 4GB 2666 MHZ 1.2V CL19  COMBO    $ 4.289.93
29           001-14 Memorias          1434 DDR4 CRUCIAL 16GB 2666 MHZ CB16GU2666 COMBO   $ 11.069.53
30           001-14 Memorias  14340 DDR4 CRUCIAL BY MICRON 8GB 2666MHZ CB8GU2666 COMBO    $ 6.477.18
31           001-14 Memorias    14342 DDR4 CRUCIAL BY MICRON 8GB 3200 MHZ CT8G4DFRA32A    $ 6.586.94
32           001-14 Memorias           14397 DDR5 CRUCIAL 8GB 4800MHZ CT8G48C40U5 1.1V   $ 11.447.20
33           001-14 Memorias         14398 DDR5 CRUCIAL 16GB 4800MHZ CT16G48C40U5 1.1V   $ 22.436.33
34           001-14 Memorias                            9971 DDR2 GENERICA 2GB 800 MHZ    $ 1.814.44
35       001-16 Motherboards    12255 MOTHER  BIOSTAR E1-6010 DDR3 C/RADEON A68N-2100K   $ 15.109.34
36       001-16 Motherboards              12323 MOTHER GIGABYTE H410M-H V2 SOC LGA1200   $ 16.680.33
37       001-16 Motherboards       12327 MOTHER MSI H310M PRO-VDH SOC 1151/8º GEN/DDR4   $ 11.658.20
38       001-16 Motherboards  12329 MOTHER BIOSTAR TZ590-BTC DUO SOC1200 PCIE 8 1/10 S   $ 72.498.30
39       001-16 Motherboards             1233 MOTHER GIGABYTE  B560M DS3H V2  SOC 1200   $ 22.518.52
40       001-16 Motherboards  12330 MOTHER ASROCK H510 PRO BTC  SOC1200/6 PCI-E/10ºGEN   $ 69.298.24
41       001-16 Motherboards             12331 MOTHER GIGABYTE Z590 UD AC SOC 1200/ATX   $ 41.845.47
42       001-16 Motherboards   12334 MOTHER ASUS H510M-E PRIME SOC 1200/DDR4/MICRO ATX   $ 18.708.36
43       001-16 Motherboards    12335 MOTHER GIGABYTE H610M-S2H DDR4 SOC1700/MICRO ATX   $ 20.250.42
  • Related