Home > Enterprise >  extracting words from text , ValueError: invalid literal for int() with base 10: ''
extracting words from text , ValueError: invalid literal for int() with base 10: ''

Time:11-09

I try to extract words from a text. So I have this text:

"[' \n\na)\n\n \n\nFactuur\nVerdi Import Schoolfruit\nFactuur nr. : 71201 Koopliedenweg 33\nDeb. nr. : 108636 2991 LN BARENDRECHT\nYour VAT nr. : NL851703884B01 Nederland\nFactuur datum : 10-12-21\nAantal Omschrijving Prijs Bedrag\nOrder number : 77553 Loading date : 09-12-21 Incoterm: : FOT\nYour ref. : SCHOOLFRUIT Delivery date :\nWK50\nD.C. Schoolfruit\n16 Watermeloenen Quetzali 16kg 4 IMPERIAL BR I € 7,70 € 123,20\n360 Watermeloenen Quetzali 16kg 4 IMPERIAL BR I € 7,70 € 2.772,00\n6 Watermeloenen Quetzali 16kg 4 IMPERIAL BR I € 7,/0 € 46,20\n75  Watermeloenen Quetzali 16kg 4 IMPERIAL BR I € 7,70 € 577,50\n9 Watermeloenen Quetzali 16kg 4 IMPERIAL BR I € 7,70 € 69,30\n688 Appels Royal Gala 13kg 60/65 Generica PL I € 5,07 € 3.488,16\n22  Sinaasappels Valencias 15kg 105 Elara ZAI € 6,25 € 137,50\n80 Sinaasappels Valencias 15kg 105 Elara ZAI € 6,25 € 500,00\n160 Sinaasappels Valencias 15kg 105 FVC ZAI € 6,25 € 1.000,00\n320 Sinaasappels Valencias 15kg 105 Generica ZAI € 6,25 € 2.000,00\n160 Sinaasappels Valencias 15kg 105 Noordhoek ZA I € 6,25 € 1.000,00\n61  Sinaasappels Valencias 15kg 105 Noordhoek ZA I € 6,25 € 381,25\nTotaal Colli Totaal Netto Btw Btw Bedrag Totaal Bedrag\n€ 12.095,11 1.088,56\nBetaling binnen 30 dagen\nAchterstand wordt gemeld bij de kredietverzekeringsmaatschappij\nVerDi Import BV ING Bank NV. Rotterdam IBAN number: NL17INGB0006959173 ~~\n\n \n\nKoopliedenweg 38, 2991 LN Barendrecht, The Netherlands SWIFT/BIC: INGBNL2A, VAT number: NL851703884B01 i\nTel,  31 (0}1 80 61 88 11, Fax  31 (0)1 8061 88 25 Chamber of Commerce Rotterdam no. 55424309 VerDi\n\nE-mail: [email protected], www.verdiimport.nl Dutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction.\n\nrut ard wegetables\n\x0c']"

and I have this method:

def total_fruit_per_sort():
    number_found = re.findall(total_amount_fruit_regex(), verdi47)
    print(number_found)
    fruit_dict = {}
    for n, f in number_found:
        fruit_dict[f] = fruit_dict.get(f, 0)   int(n)
    return {value: key for key, value in fruit_dict.items()}

def total_amount_fruit_regex(format_=re.escape):

    return r"(\d*(?:\.\d )*)\s*("   '|'.join(format_(word)
                                             for word in fruit_words)   ')'

and the fruit_words:

fruit_words = ['Appels', 'Ananas', 'Peen Waspeen',
               'Tomaten Cherry', 'Sinaasappels',
               'Watermeloenen', 'Rettich', 'Peren', 'Peen', 'Mandarijnen', 'Meloenen', 'Grapefruit']

and then the print returns this:

[('16', 'Watermeloenen'), ('360', 'Watermeloenen'), ('6', 'Watermeloenen'), ('75', 'Watermeloenen'), ('9', 'Watermeloenen'), ('688', 'Appels'), ('22', 'Sinaasappels'), ('80', 'Sinaasappels'), ('160', 'Sinaasappels'), ('320', 'Sinaasappels'), ('160', 'Sinaasappels'), ('61', 'Sinaasappels')]

So this is correct.

But then I have this text:

"['a= (>)\n\nFactuur\nVerdi Import Schoolfruit\nFactuur nr; %: 70273 Koopliedenweg 38\nDeb. nr. : 108636 2991 LN BARENDRECHT\nYour VAT nr. : NL851703884B01 Nederland\nFactuur datum : 19-11-21\nAantal Omschrijving Prijs Bedrag\nOrder number : 76372 Loading date : 15-11-21 Incoterm: : FOT\nYour ref. : SCHOOLFRUIT Delivery date :\nWK46\nVerdi Import Schoolfruit\n566 Ananas Crownless 14kg 10 Sweet CR Klasse I € 7,00 € 3.962,00\n706 Appels Royal Gala 13kg 60/65 Generica PL Klasse I € 4,68 € 3.304,08\n598 Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I € 6,30 3.767,40\nOrder number : 76462 Loading date : 18-11-21 Incoterm: : FOT\nYour ref. : SCHOOLFRUIT Delivery date :\nWK47\nD.C, Schoolfruit\n176 Sinaasappels Valencias 15kg 125 Generica UY Klasse I € 6,25 € 1.100,00\n179 Peen Waspeen 14x1kg 200-400 Generica BE Klasse I € 6,30 € 1.127,70\n222 Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I € 6,30 € 1.398,60\n270 Peen Waspeen 14x1ikg 200-400 Generica BE Klasse I € 6,30 € 1.701,00\nZuid\n176 sinaas\n222 wortel\nmidden\n270 wortel\nNoord\n179 wortel\nOrder number : 75674 Loading date : 18-11-21 Incoterm: : FRA\nYour ref. : SCHOOLFRUIT Delivery date :\nWK47\nD.C. Schoolfruit\n400 Rettich Klein x20 10kg 20 GENER DE Klasse I € 4,70 € 1.880,00\n129 Rettich Klein x20 10kg 20 GENER DE Klasse I € 4,70 € 606,30\n48 Rettich Klein x20 10kg 20 GENER IT Klasse I € 4,70 € 225,60\n104 = Rettich Klein x20 10kg 20 GENER IT Klasse I € 4,70 € 488,80\n22 =Rettich Klein x20 10kg 20 Viva IT Klasse I € 4,70 € 103,40\n107 ~=Rettich Klein x20 10kg 20 Viva IT Klasse I € 4,70 € 502,90\n160 Sinaasappels Valencias 15kg 125 ALG ZA Klasse I € 7,50 € 1.200,00\n6 Sinaasappels Valencias 15kg 125 ALG ZA Klasse I € 7,50 € 45,00\n320 Sinaasappels Valencias 15kg 125 FVC ZA Klasse I € 7,50 € 2.400,00\nREGIO\nSINAAS\nMIDDEN: 219\nNOORD: 267\nVerDi Import BV ING Bank N.V. Rotterdam IBAN number: NL17INGB0006959173 aoethe\nKoopliedenweg 38, 2991 LN Barendrecht, The Netherlands SWIFT/BIC: INGBNL2A, VAT number: NL851703884B01\n\na\nTel.  31 (0)1 80 61 88 11, Fax  31 (0)1 80 61 88 25 Chamber of Commerce Rotterdam no. 55424309 VerDi\nE-mail: [email protected], www.verdiimport.nl Dutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction.\n\nfrult and wegetadles\n\n \n\x0c', 'a> >)\n\nFactuur\nVerdi Import Schoolfruit\nFactuur nr. : 70273 Koopliedenweg 38\nDeb. nr. : 108636 2991 LN BARENDRECHT\nYour VAT nr. : NL851703884B01 Nederland\nFactuur datum ; 19-11-21\nAantal Omschrijving Prijs Bedrag\nRETTICH:\nZUID: 216\nNOORD: 328\nMIDDEN: 266\nTotaal Colli Totaal Netto Btw Btw Bedrag Totaal Bedrag\n\n     \n \n\n€ 23.812,78 € 25.955,93\n\n   \n\nBetaling binnen 30 dagen\nAchterstand wordt gemeld bij de kredietverzekeringsmaatschappij\n\nVerDi Import BV ING Bank N.V. Rotterdam IBAN number: NL17INGBO006959173 =\nKoopliedenweg 38, 2991 LN Barendrecht, The Netherlands SWIFT/BIC: INGBNL2A, VAT number: NL851703884B01 7\nTel.  31 (0)1 80 61 88 11, Fax  31 (0)1 80 61 88 25 Chamber of Commerce Rotterdam no. 55424309 VerD\nE-mail: [email protected], www.verdiimport.nl Dutch law shall apply. The Rotterdam District Court shall have exclusive jurisdiction. l\n\nfrutt and vegetables:\n\n \n\x0c']"

and it returns this:

[('566', 'Ananas'), ('706', 'Appels'), ('598', 'Peen Waspeen'), ('176', 'Sinaasappels'), ('179', 'Peen Waspeen'), ('222', 'Peen Waspeen'), ('270', 'Peen Waspeen'), ('400', 'Rettich'), ('129', 'Rettich'), ('48', 'Rettich'), ('', 'Rettich'), ('', 'Rettich'), ('', 'Rettich'), ('160', 'Sinaasappels'), ('6', 'Sinaasappels'), ('320', 'Sinaasappels')]

So Rettich has a lot of empty values.

Question. How can I improve this? So that by also the second text all the values will be extracted?

CodePudding user response:

you need to change the regexp to allow an optional = or ~= between the number and fruit.

def total_amount_fruit_regex(format_=re.escape):
    return r"(\d*(?:\.\d )*)\s*(?:=|~=)?\s*("   '|'.join(
        format_(word) for word in fruit_words)   ')'
  • Related