I have scraped the necesary items from a PDF to convert it to a dataframe, but im having a hard time to correctly organizating the rows and columns.
# open the PDF as an object and read it into PyPDF2.
pdfFileObj = open('/FuentesDeDatos/AltaVista-Datos/lista3-pes.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# create page object and extract text
pageObj = pdfReader.getPage(0)
page1 = pageObj.extractText()
# strip away page header and footer
page1 = page1[295:]
page1 = page1[:-61]
#Replace Characters
page1 = page1.replace(',','.')
page1 = page1.replace('\n',',')
#Split categories
page1 = page1.split(', Código Descripción Precio final,')
page1
['001-12 Discos rígidos',
'1779 HD 1 TB SATA 3 WD BLUE 64MB WD10EZEX,$ 9.041.78,1860 HD 2 TB SATA 3 WD BLUE 64MB WD20EZAZ,$ 10.467.19,1986 HD 2 TB SATA 6 SEAGATE BARRACUDA,$ 11.588.09,3119 HD 1 TB SATA 6 SEAGATE BARRACUDA 64MB,$ 8.254.13,3121 HD 2 TB SATA 6 SEAGATE SKYHAWK 64MB SURVEILLANCE,$ 12.739.61,32173 HD 2 TB SATA 3 WD PURPURA 64MB WD22PURZ,$ 13.900.90,32176 HD 1 TB SATA 3 WD PURPURA 64MB WD10PURZ,$ 10.942.26,001-13 Microprocesadores',
'1013 MICRO AMD RYZEN 7 5700G 4.6 GHZ 8 CORE AM4,$ 65.544.62,1014 MICRO AMD RYZEN 5-5600G 4.4 GHZ 8 CORE AM4,$ 42.564.93,1017 MICRO AMD RYZEN 5-4600G 4.2GHZ/11MB/AM4,$ 34.605.73,11052 MICRO INTEL CORE I7-9700 3GHZ/12MB/LGA1151,$ 61.534.02,1122 MICRO INTEL CORE I3-10100F/3.6GHZ/6MB/SOC 1200,$ 16.502.81,1126 MICRO INTEL CELERON G5925 SOC 1200/3.6GHZ/4MB,$ 10.344.88,1130 MICRO INTEL CORE I7 10700 2.9GHZ LGA 1200,$ 68.196.69,1139 MICRO INTEL CORE I5-10400/2.9GHZ/12 MB/LGA 1200,$ 41.014.79,11820 MICRO AMD RYZEN 7 PRO 4750G OEM FAN/3.6GHZ/8MB/AM4,$ 66.166.41,11821 MICRO INTEL CORE I9 12900/30MB/LGA 1700,$ 129.068.54,11822 MICRO INTEL CORE I3-12100F/12MB/SOC 1700,$ 26.501.69,1683 MICRO INTEL CORE I5-10400F/2.9GHZ/12MB/LGA 1200,$ 32.806.31,1684 MICRO INTEL PENTIUM GOLD G6400/4GHZ/4MB/LGA 1200,$ 13.297.60,3555 MICRO INTEL CORE I7-11700/2.5GHZ/16MB/SOC 1200,$ 73.711.21,3557 MICRO INTEL CORE I3-10105/3.7GHZ/6MB/SOC 1200,$ 30.940.77,3559 MICRO INTEL CORE I3-10105F/3.7GHZ/6MB/SOC 1200,$ 16.500.74,3572 MICRO INTEL CELERON G6900 ALDERLAKE/4MB/LGA1700,$ 13.017.25,3573 MICRO INTEL CORE I7-11700F/2.5GHZ/16MB/SOC 1200,$ 67.528.69,7772 MICRO AMD RYZEN 5-4500 PRO OEM FAN/4.0GHZ/4MB/AM4,$ 30.123.06,9214 MICRO INTEL CORE I7 10700F 2.9GHZ LGA 1200,$ 61.649.55,001-14 Memorias',
'12801 SODIMM DDR3 MEMOX 1GB 1333 MHZ DDR3,$ 2.069.36,142941 DDR4 HIKVISION U1 4GB 2666 MHZ 1.2V CL19 COMBO,$ 4.289.93,1434 DDR4 CRUCIAL 16GB 2666 MHZ CB16GU2666 COMBO,$ 11.069.53,14340 DDR4 CRUCIAL BY MICRON 8GB 2666MHZ CB8GU2666 COMBO,$ 6.477.18,14342 DDR4 CRUCIAL BY MICRON 8GB 3200 MHZ CT8G4DFRA32A,$ 6.586.94,14397 DDR5 CRUCIAL 8GB 4800MHZ CT8G48C40U5 1.1V,$ 11.447.20,14398 DDR5 CRUCIAL 16GB 4800MHZ CT16G48C40U5 1.1V,$ 22.436.33,9971 DDR2 GENERICA 2GB 800 MHZ,$ 1.814.44,001-16 Motherboards',
'12255 MOTHER BIOSTAR E1-6010 DDR3 C/RADEON A68N-2100K,$ 15.109.34,12323 MOTHER GIGABYTE H410M-H V2 SOC LGA1200,$ 16.680.33,12327 MOTHER MSI H310M PRO-VDH SOC 1151/8º GEN/DDR4,$ 11.658.20,12329 MOTHER BIOSTAR TZ590-BTC DUO SOC1200 PCIE 8 1/10 S,$ 72.498.30,1233 MOTHER GIGABYTE B560M DS3H V2 SOC 1200,$ 22.518.52,12330 MOTHER ASROCK H510 PRO BTC SOC1200/6 PCI-E/10ºGEN,$ 69.298.24,12331 MOTHER GIGABYTE Z590 UD AC SOC 1200/ATX,$ 41.845.47,12334 MOTHER ASUS H510M-E PRIME SOC 1200/DDR4/MICRO ATX,$ 18.708.36,12335 MOTHER GIGABYTE H610M-S2H DDR4 SOC1700/MICRO ATX,$ 20.250.42']
And the main idea is to convert this list of string to a dataframe with this structure. Where column 1 = category(repeats), 2 item, and 3 = price. Note that the category of each string is at the end of the previous string.
I have tried split(",")
and convert it to a dataframe but it results in a mess.
I would really appreciate some help or guidance on how to do it. Many thanks in advance
CodePudding user response:
You can try:
import re
import pandas as pd
from itertools import groupby
page1 = [
"001-12 Discos rígidos",
"1779 HD 1 TB SATA 3 WD BLUE 64MB WD10EZEX,$ 9.041.78,1860 HD 2 TB SATA 3 WD BLUE 64MB WD20EZAZ,$ 10.467.19,1986 HD 2 TB SATA 6 SEAGATE BARRACUDA,$ 11.588.09,3119 HD 1 TB SATA 6 SEAGATE BARRACUDA 64MB,$ 8.254.13,3121 HD 2 TB SATA 6 SEAGATE SKYHAWK 64MB SURVEILLANCE,$ 12.739.61,32173 HD 2 TB SATA 3 WD PURPURA 64MB WD22PURZ,$ 13.900.90,32176 HD 1 TB SATA 3 WD PURPURA 64MB WD10PURZ,$ 10.942.26,001-13 Microprocesadores",
"1013 MICRO AMD RYZEN 7 5700G 4.6 GHZ 8 CORE AM4,$ 65.544.62,1014 MICRO AMD RYZEN 5-5600G 4.4 GHZ 8 CORE AM4,$ 42.564.93,1017 MICRO AMD RYZEN 5-4600G 4.2GHZ/11MB/AM4,$ 34.605.73,11052 MICRO INTEL CORE I7-9700 3GHZ/12MB/LGA1151,$ 61.534.02,1122 MICRO INTEL CORE I3-10100F/3.6GHZ/6MB/SOC 1200,$ 16.502.81,1126 MICRO INTEL CELERON G5925 SOC 1200/3.6GHZ/4MB,$ 10.344.88,1130 MICRO INTEL CORE I7 10700 2.9GHZ LGA 1200,$ 68.196.69,1139 MICRO INTEL CORE I5-10400/2.9GHZ/12 MB/LGA 1200,$ 41.014.79,11820 MICRO AMD RYZEN 7 PRO 4750G OEM FAN/3.6GHZ/8MB/AM4,$ 66.166.41,11821 MICRO INTEL CORE I9 12900/30MB/LGA 1700,$ 129.068.54,11822 MICRO INTEL CORE I3-12100F/12MB/SOC 1700,$ 26.501.69,1683 MICRO INTEL CORE I5-10400F/2.9GHZ/12MB/LGA 1200,$ 32.806.31,1684 MICRO INTEL PENTIUM GOLD G6400/4GHZ/4MB/LGA 1200,$ 13.297.60,3555 MICRO INTEL CORE I7-11700/2.5GHZ/16MB/SOC 1200,$ 73.711.21,3557 MICRO INTEL CORE I3-10105/3.7GHZ/6MB/SOC 1200,$ 30.940.77,3559 MICRO INTEL CORE I3-10105F/3.7GHZ/6MB/SOC 1200,$ 16.500.74,3572 MICRO INTEL CELERON G6900 ALDERLAKE/4MB/LGA1700,$ 13.017.25,3573 MICRO INTEL CORE I7-11700F/2.5GHZ/16MB/SOC 1200,$ 67.528.69,7772 MICRO AMD RYZEN 5-4500 PRO OEM FAN/4.0GHZ/4MB/AM4,$ 30.123.06,9214 MICRO INTEL CORE I7 10700F 2.9GHZ LGA 1200,$ 61.649.55,001-14 Memorias",
"12801 SODIMM DDR3 MEMOX 1GB 1333 MHZ DDR3,$ 2.069.36,142941 DDR4 HIKVISION U1 4GB 2666 MHZ 1.2V CL19 COMBO,$ 4.289.93,1434 DDR4 CRUCIAL 16GB 2666 MHZ CB16GU2666 COMBO,$ 11.069.53,14340 DDR4 CRUCIAL BY MICRON 8GB 2666MHZ CB8GU2666 COMBO,$ 6.477.18,14342 DDR4 CRUCIAL BY MICRON 8GB 3200 MHZ CT8G4DFRA32A,$ 6.586.94,14397 DDR5 CRUCIAL 8GB 4800MHZ CT8G48C40U5 1.1V,$ 11.447.20,14398 DDR5 CRUCIAL 16GB 4800MHZ CT16G48C40U5 1.1V,$ 22.436.33,9971 DDR2 GENERICA 2GB 800 MHZ,$ 1.814.44,001-16 Motherboards",
"12255 MOTHER BIOSTAR E1-6010 DDR3 C/RADEON A68N-2100K,$ 15.109.34,12323 MOTHER GIGABYTE H410M-H V2 SOC LGA1200,$ 16.680.33,12327 MOTHER MSI H310M PRO-VDH SOC 1151/8º GEN/DDR4,$ 11.658.20,12329 MOTHER BIOSTAR TZ590-BTC DUO SOC1200 PCIE 8 1/10 S,$ 72.498.30,1233 MOTHER GIGABYTE B560M DS3H V2 SOC 1200,$ 22.518.52,12330 MOTHER ASROCK H510 PRO BTC SOC1200/6 PCI-E/10ºGEN,$ 69.298.24,12331 MOTHER GIGABYTE Z590 UD AC SOC 1200/ATX,$ 41.845.47,12334 MOTHER ASUS H510M-E PRIME SOC 1200/DDR4/MICRO ATX,$ 18.708.36,12335 MOTHER GIGABYTE H610M-S2H DDR4 SOC1700/MICRO ATX,$ 20.250.42",
]
page1 = ",".join(page1).split(",")
pat = re.compile(r"\d{3}-\d ")
groups, last_group = {}, None
for k, g in groupby(page1, pat.match):
if k:
last_group = next(g)
groups[last_group] = []
else:
groups[last_group].extend(zip(g, g))
df = pd.DataFrame(
[
{"Col1": k, "Col2": name, "Col3": price}
for k, v in groups.items()
for name, price in v
]
)
print(df)
Prints:
Col1 Col2 Col3
0 001-12 Discos rígidos 1779 HD 1 TB SATA 3 WD BLUE 64MB WD10EZEX $ 9.041.78
1 001-12 Discos rígidos 1860 HD 2 TB SATA 3 WD BLUE 64MB WD20EZAZ $ 10.467.19
2 001-12 Discos rígidos 1986 HD 2 TB SATA 6 SEAGATE BARRACUDA $ 11.588.09
3 001-12 Discos rígidos 3119 HD 1 TB SATA 6 SEAGATE BARRACUDA 64MB $ 8.254.13
4 001-12 Discos rígidos 3121 HD 2 TB SATA 6 SEAGATE SKYHAWK 64MB SURVEILLANCE $ 12.739.61
5 001-12 Discos rígidos 32173 HD 2 TB SATA 3 WD PURPURA 64MB WD22PURZ $ 13.900.90
6 001-12 Discos rígidos 32176 HD 1 TB SATA 3 WD PURPURA 64MB WD10PURZ $ 10.942.26
7 001-13 Microprocesadores 1013 MICRO AMD RYZEN 7 5700G 4.6 GHZ 8 CORE AM4 $ 65.544.62
8 001-13 Microprocesadores 1014 MICRO AMD RYZEN 5-5600G 4.4 GHZ 8 CORE AM4 $ 42.564.93
9 001-13 Microprocesadores 1017 MICRO AMD RYZEN 5-4600G 4.2GHZ/11MB/AM4 $ 34.605.73
10 001-13 Microprocesadores 11052 MICRO INTEL CORE I7-9700 3GHZ/12MB/LGA1151 $ 61.534.02
11 001-13 Microprocesadores 1122 MICRO INTEL CORE I3-10100F/3.6GHZ/6MB/SOC 1200 $ 16.502.81
12 001-13 Microprocesadores 1126 MICRO INTEL CELERON G5925 SOC 1200/3.6GHZ/4MB $ 10.344.88
13 001-13 Microprocesadores 1130 MICRO INTEL CORE I7 10700 2.9GHZ LGA 1200 $ 68.196.69
14 001-13 Microprocesadores 1139 MICRO INTEL CORE I5-10400/2.9GHZ/12 MB/LGA 1200 $ 41.014.79
15 001-13 Microprocesadores 11820 MICRO AMD RYZEN 7 PRO 4750G OEM FAN/3.6GHZ/8MB/AM4 $ 66.166.41
16 001-13 Microprocesadores 11821 MICRO INTEL CORE I9 12900/30MB/LGA 1700 $ 129.068.54
17 001-13 Microprocesadores 11822 MICRO INTEL CORE I3-12100F/12MB/SOC 1700 $ 26.501.69
18 001-13 Microprocesadores 1683 MICRO INTEL CORE I5-10400F/2.9GHZ/12MB/LGA 1200 $ 32.806.31
19 001-13 Microprocesadores 1684 MICRO INTEL PENTIUM GOLD G6400/4GHZ/4MB/LGA 1200 $ 13.297.60
20 001-13 Microprocesadores 3555 MICRO INTEL CORE I7-11700/2.5GHZ/16MB/SOC 1200 $ 73.711.21
21 001-13 Microprocesadores 3557 MICRO INTEL CORE I3-10105/3.7GHZ/6MB/SOC 1200 $ 30.940.77
22 001-13 Microprocesadores 3559 MICRO INTEL CORE I3-10105F/3.7GHZ/6MB/SOC 1200 $ 16.500.74
23 001-13 Microprocesadores 3572 MICRO INTEL CELERON G6900 ALDERLAKE/4MB/LGA1700 $ 13.017.25
24 001-13 Microprocesadores 3573 MICRO INTEL CORE I7-11700F/2.5GHZ/16MB/SOC 1200 $ 67.528.69
25 001-13 Microprocesadores 7772 MICRO AMD RYZEN 5-4500 PRO OEM FAN/4.0GHZ/4MB/AM4 $ 30.123.06
26 001-13 Microprocesadores 9214 MICRO INTEL CORE I7 10700F 2.9GHZ LGA 1200 $ 61.649.55
27 001-14 Memorias 12801 SODIMM DDR3 MEMOX 1GB 1333 MHZ DDR3 $ 2.069.36
28 001-14 Memorias 142941 DDR4 HIKVISION U1 4GB 2666 MHZ 1.2V CL19 COMBO $ 4.289.93
29 001-14 Memorias 1434 DDR4 CRUCIAL 16GB 2666 MHZ CB16GU2666 COMBO $ 11.069.53
30 001-14 Memorias 14340 DDR4 CRUCIAL BY MICRON 8GB 2666MHZ CB8GU2666 COMBO $ 6.477.18
31 001-14 Memorias 14342 DDR4 CRUCIAL BY MICRON 8GB 3200 MHZ CT8G4DFRA32A $ 6.586.94
32 001-14 Memorias 14397 DDR5 CRUCIAL 8GB 4800MHZ CT8G48C40U5 1.1V $ 11.447.20
33 001-14 Memorias 14398 DDR5 CRUCIAL 16GB 4800MHZ CT16G48C40U5 1.1V $ 22.436.33
34 001-14 Memorias 9971 DDR2 GENERICA 2GB 800 MHZ $ 1.814.44
35 001-16 Motherboards 12255 MOTHER BIOSTAR E1-6010 DDR3 C/RADEON A68N-2100K $ 15.109.34
36 001-16 Motherboards 12323 MOTHER GIGABYTE H410M-H V2 SOC LGA1200 $ 16.680.33
37 001-16 Motherboards 12327 MOTHER MSI H310M PRO-VDH SOC 1151/8º GEN/DDR4 $ 11.658.20
38 001-16 Motherboards 12329 MOTHER BIOSTAR TZ590-BTC DUO SOC1200 PCIE 8 1/10 S $ 72.498.30
39 001-16 Motherboards 1233 MOTHER GIGABYTE B560M DS3H V2 SOC 1200 $ 22.518.52
40 001-16 Motherboards 12330 MOTHER ASROCK H510 PRO BTC SOC1200/6 PCI-E/10ºGEN $ 69.298.24
41 001-16 Motherboards 12331 MOTHER GIGABYTE Z590 UD AC SOC 1200/ATX $ 41.845.47
42 001-16 Motherboards 12334 MOTHER ASUS H510M-E PRIME SOC 1200/DDR4/MICRO ATX $ 18.708.36
43 001-16 Motherboards 12335 MOTHER GIGABYTE H610M-S2H DDR4 SOC1700/MICRO ATX $ 20.250.42