Home > other >  How to complete for loop with pdfplumber?
How to complete for loop with pdfplumber?

Time:09-27

Problem

I was following this tutorial https://www.youtube.com/watch?v=eTz3VZmNPSE&list=PLxEus0qxF0wciRWRHIRck51EJRiQyiwZT&index=16

when the code has returned my this error.

Goal

I need to scrape a pdf that looks like this (I wanted to attach the pdf but I do not know how):

170001WO01 
English (US) into Arabic (DZ) 
Trans./Edit/Proof. 22.117,00 Words 1,350 29.857,95 
TM - Fuzzy Match 2.941,00 Words 0,500 1.470,50
TM - Exact Match 353,00 Words 0,100 35,30

Approach

I am following the tutorial aforementioned with pdfplumber.

import re
import pdfplumber 
import PyPDF2
import pandas as pd
from collections import namedtuple 
ap = open('test.pdf', 'rb')

I name the column of the dataframe that I want as a final product.

Serv = namedtuple('Serv', 'case_number language num_trans num_fuzzy num_exact')

Issues

I have 5 different lines compared to the tutorial example which has 2.

case_li = re.compile(r'(\d{6}\w{2}\d{2})')
language_li = re.compile(r'(nglish \(US\) into )(.*)')
trans_li = re.compile(r'(Trans./Edit/Proof.              )(\d{2}\.\d{3})')
fuzzy_li = re.compile(r'(TM - Fuzzy Match                )(\d{1}\.\d{3})')
exact_li = re.compile(r'(M - Exact Match                )(\d{3})')

Issue

When I introduce the third line in the code, I got an error which I do not know. I have modified the code as 2e0byo suggested but I still get an error.

This is the new code:

line_items = []
with pdfplumber.open(ap) as pdf:
    page = pdf.pages
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('\n'):
            
            line = case_li.search(line)
            if line:
                case_number = line
            
            line = language_li.search(line)
            if line:
                language = line.group(2)
            
            line = trans_li.search(line)
            if line:
                num_trans = line.group(2)
            
            line = fuzzy_li.search(line)
            if line:
                num_fuzzy = line.group(2)
            
            line = exact_li.search(line)
            if line:
                num_exact = line.group(2)
                
            line_items.append(Serv(case_number, language, num_trans, num_fuzzy, num_exact))```
---------------------------------------------------------------------------

and this is the new error:

TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13992/1572426536.py in <module>
     10                 case_number = line
     11 
---> 12             line = language_li.search(line)
     13             if line:
     14                 language = line.group(2)

TypeError: expected string or bytes-like object
TypeError: expected string or bytes-like object
# GOAL
It would be to append the lines to line_items and eventually 

df = pd.DataFrame(line_items)

CodePudding user response:

You have reassigned line, here:

for line in text.split("\n"):
    # line is a str (the line)
    line = language_li.search(line)
    # line is no longer a str, but the result of a re.search

so line is no longer the text line, but the result of that match. Thus trans_li.search(line) is not searching the line you thought it was.

To fix your code, adopt a consistent pattern:

for line in text.split("\n"):
    match = language_li.search(line)
    # line is still a str (the line)
    # match is the result of re.search
    if match:
        do_something(match.groups())
        ...

    # line is *still* a str

match = trans_li.search(line):
if match:
    ...

For completeness' sake, with the dreaded walrus operator you can now write this:

if match := language_li.search(line) is not None:
    do_something(match.groups())

Which I briefly thought was neater, but now think ugly. I fully expect to get downvoted just for mentioning the walrus operator. (If you look at the edit history of this post you will see that I have even forgotten how to use it and wrote it backwards first.)

PS: you may wish to read up on variable scope in python, although no language I know would allow this particular scope collision (overwriting a loop variable within the loop). Incidentally doing this kind of thing by mistake is why conventionally we avoid similarly-named variables (like line and Line) and go with things like line and match instead.

  • Related