Home > Blockchain >  Regular expressions python - get only the description
Regular expressions python - get only the description

Time:10-10

i am newbie in python, and i am trying to use RE to transform some PDF in DF.

So, for now i have a list with this information

list = ['9076968 ADT 10mg 60comp 22CN014A T E1 059366 5 2,72 1,97 1,56 0,0 0,01 6 1,57 7,85',
 '9076943 ADT 25mg 60comp 22CN061A T E1 059366 10 3,91 3,09 2,60 0,0 0,01 6 2,61 26,10',
 '3506888 Aerius 5mg 20comp W010992 T E1 094546 5 4,99 4,11 3,53 10,0 0,02 6 3,20 16,00',
 '9046755 Aldactone 25mg 60comp B28191 G E1 084399 10 5,42 4,51 3,90 22,0 0,02 6 3,06 30,60',
 '5282132 Aranka MG 3mg 0,03mg 63comp T21521A G E2 087961 5 8,22 6,51 5,47 12,5 0,03 6 4,82 24,10',
 '6589168 Arnidol Gel Stick 15ml S-02 G NETT 054786 5 5,80 16,0 0,00 23 4,87 24,35',
 '5260542 Atorvastatina Azevedos MG 10mg 56comp 11400 T E1 094546 10 3,76 2,94 2,46 55,0 0,01 6 1,12 11,20',
 '5260559 Atorvastatina Azevedos MG 20mg 28comp 20515 T E1 059366 20 3,57 2,76 2,29 55,0 0,01 6 1,04 20,80',
 '5260575 Atorvastatina Azevedos MG 40mg 28comp 20516 T E1 059366 10 4,46 3,61 3,07 55,0 0,02 6 1,40 14,00',
 '5629506 Atozet 10mg 20mg 30comp W016401 N E5 093541 5 41,63 34,59 29,72 0,0 0,16 6 29,88 149,40',
 '7377390 Atyflor 10saq 124011 G NETT 087961 5 8,25 14,3 0,00 23 7,07 35,35',
 '2003093 Bebegel Gel Retal 6un 2206EA M NETT 024839 5 4,00 0,0 0,00 6 4,00 20,00',
 '8435701 Betadine Solucao Cutanea 125ml 326893 M NETT 084780 10 4,20 0,0 0,00 6 4,20 42,00',
 '2869584 Betamox Plus 875mg 125mg 16comp R017905R T E1 093541 30 6,34 5,39 4,71 60,0 0,02 6 1,90 57,00',
 '8184812 Betnovate 1mg/g Pomada 30g S63C N E1 022851 5 3,46 2,66 2,20 0,0 0,01 6 2,21 11,05',
 '2184992 Biloban 40mg 60comp rev R002315R T E2 059366 10 9,57 7,44 6,32 10,0 0,04 6 5,73 57,30',
 '5065487 Bisoprolol Sandoz MG 5mg 56comp LX8098 N E1 022851 5 5,01 4,13 3,55 0,0 0,02 6 3,57 17,85',
 '5138276 Buprenorfina Azevedos MG 8mg 7comp (P) 22E16 T E3 087485 30 11,15 8,83 7,42 5,0 0,04 6 7,09 212,70',
 '3126489 Calcitab 1500mg 60comp EQ22502 N E1 054786 5 6,29 5,34 4,66 0,0 0,02 6 4,68 23,40',
 '9729509 Cartia 100mg 28comp 20015 G E1 022851 30 5,41 4,13 3,55 45,0 0,02 6 1,97 59,10',
 '5037288 Ciprofloxacina Azevedos MG 500mg 16comp 11496 T E3 054786 5 10,87 8,57 7,18 70,0 0,04 6 2,19 10,95',
 '5273487 Co-Diovan Forte 160mg 25mg 28comp TRM93 N E2 022851 5 8,10 6,40 5,37 0,0 0,03 6 5,40 27,00',
 '8287607 Cordarone 200 mg x 60 Comprimidos 2R362 N E3 022851 5 11,36 9,03 7,61 0,0 0,04 6 7,65 38,25',
 '5440284 Coversyl 5mg 30comp rev 711191 N E1 022851 10 6,47 5,52 4,83 0,0 0,02 6 4,85 48,50',
 '5627781 Cozaar Plus 100mg   12,5mg 28comp W020945 T E2 054786 5 7,69 6,01 5,01 9,0 0,03 6 4,59 22,95'

i want to grab de descrition of every line starting in index 8, after 7 number characters one space, and stop in the space before the last letter that can be T, N, G, M.

Example : 5627781 Cozaar Plus 100mg 12,5mg 28comp W020945 T E2 054786 5 7,69 6,01 5,01 9,0 0,03 6 4,59 22,95'

  • Cozaar Plus 100mg 12,5mg 28comp W020945 or better Cozaar Plus 100mg 12,5mg 28comp

-> W020945 is the Lot information, but it's not a standard for every line

i try something like this

description_re = re.compile(r'\d{7}\s[A-Za-z] \s[TNGM]$') but dont work

Tanks

CodePudding user response:

Using positive look behinds and look aheads will help you out:

(?<=\d{7} ).*?(?= \w  [TNGM] )

regex101

CodePudding user response:

You can use a capture group and word boundaries:

\b\d{7}\s(.*?)\s[TNGM]\b

Regex demo

CodePudding user response:

I modify the regex code of answer 1. Here is the code:

(?<=\d{7}\s).*(?=\s\S \s[TNGM]\s)

And demo

  • Related