Home > Mobile >  Capturing salary from a string using capture groups (python, regex)
Capturing salary from a string using capture groups (python, regex)

Time:11-05

My goal is to extract salary from a string and if this salary comes from collective agreement. I have come up with a following regex:

pattern = "([Kk]ollektivvertragliche[sn]|Kollektivvertrag|[Cc]ollective [Aa]greement|[Kk]ollektivvertr|KV) .* ([0-9]{1,4}[.,][0-9]{2,3}[,]*[0-9]*) .* ([0-9]{1,4}[.,][0-9]{2,3}[,]*[0-9]*)"
  • ([Kk]ollektivvertragliche[sn]|Kollektivvertrag|[Cc]ollective [Aa]greement|[Kk]ollektivvertr|KV) - first group, that captures if salary is defined according to collective agreement.
  • ([0-9]{1,4}[.,][0-9]{2,3}[,]*[0-9]*) - is a salary of the form xx.yyy,zz/xx.yyy,zz/x.yyy/xxx (for example 21.950,13/1.859,20/1.700/700 $)
  • . between collective agreement and salary matches any character.

I have tested and it looks like it works well if all groups are available:

t = 'Entlohnung nach Caritas Kollektivvertrag : __ [lineBreak] Mindestgehalt in Gehaltsstufe 1 - Verwendungsgruppe VI , dzt. EUR 1.849,90 bis 1.900,32 '
r = re.search(pattern,t)
r.groups()

But if some of the groups(for example collective agreement or salary) is missed, it doesn't work.

string examples:
#t = 'EUR 35.362,00 Jahresbrutto'
#t = '2.800 brutto/Monat'
#t = 'laut Kollektivvertrag beträgt € 1.597,72 brutto pro Monat auf Basis Vollzeitbeschäftigung.'
#t = 'KV-Mindestgehalt von monatlich € 1.671,00'
#t = 'kollektivvertragliches Mindestgehalt von € 2.026,88 brutto pro Monat'
#t = 'Bruttojahreseinkommen ab € 50.000,'
#t = 'ein KV-Mindestlohn von EUR 1.277,00 brutto pro Monat'
#t = 'beträgt jedoch mindestens € 25.480'
#t = 'Gehalt lt. BAGS-KV €\xa02.100,78 brutto'
#t = 'kollektivvertraglicher Mindestgehalt EUR 33.000 Brutto/Jahr'
#t = '\nLohn/Gehalt ab EUR 2500,00 brutto monatlich,'
#t = 'KV IT __EUR 2.302'
#t = '25 Wochenstunden EUR [lineBreak] 1.641,91 bis EUR 1.859,21 brutto'
#t = 'Entlohnung nach Caritas Kollektivvertrag: __ [lineBreak] Mindestgehalt in Gehaltsstufe 1 - Verwendungsgruppe VI , dzt. EUR 1.849,90'
#t = 'The position is remunerated according to the Kollektivvertrag for Austrian Universities, i.e., the salary amounts to at least 38.230EUR/year before taxes'
#t = 'Erfahrung bieten wir ein Bruttojahresgehalt ab EUR 36.400.'

I have tried to impleemnt optional groups with a help of ? and ?: (like in this post python regex optional capture group), but it didn't work either.

Desired output: (group_1_result,group_2_result,group_3_result).

If some of the group is missed: instead of "group_n_result" I would like to have None

CodePudding user response:

It seems capturing several groups matching the same pattern requires finditer. The best I could achieve is this (with some tweaking you should be able to fit it to your needs):

import re

pattern1 = re.compile("([Kk]ollektivvertragliche[snr]|Kollektivvertrag|[Cc]ollective [Aa]greement|[Kk]ollektivvertr|KV)")
pattern2 = re.compile("([0-9]{1,4}.[0-9]{2,3}[,[0-9] ]*)")


t = ''''Entlohnung nach Caritas Kollektivvertrag : __ [lineBreak] Mindestgehalt in Gehaltsstufe 1 - Verwendungsgruppe VI , dzt. EUR 1.849,90 bis 1.900,32 '
#t = 'EUR 35.362,00 Jahresbrutto'
#t = '2.800 brutto/Monat'
#t = 'laut Kollektivvertrag beträgt € 1.597,72 brutto pro Monat auf Basis Vollzeitbeschäftigung.'
#t = 'KV-Mindestgehalt von monatlich € 1.671,00'
#t = 'kollektivvertragliches Mindestgehalt von € 2.026,88 brutto pro Monat'
#t = 'Bruttojahreseinkommen ab € 50.000,'
#t = 'ein KV-Mindestlohn von EUR 1.277,00 brutto pro Monat'
#t = 'beträgt jedoch mindestens € 25.480'
#t = 'Gehalt lt. BAGS-KV €\xa02.100,78 brutto'
#t = 'kollektivvertraglicher Mindestgehalt EUR 33.000 Brutto/Jahr'
#t = '\nLohn/Gehalt ab EUR 2500,00 brutto monatlich,'
#t = 'KV IT __EUR 2.302'
#t = '25 Wochenstunden EUR [lineBreak] 1.641,91 bis EUR 1.859,21 brutto'
#t = 'Entlohnung nach Caritas Kollektivvertrag: __ [lineBreak] Mindestgehalt in Gehaltsstufe 1 - Verwendungsgruppe VI , dzt. EUR 1.849,90'
#t = 'The position is remunerated according to the Kollektivvertrag for Austrian Universities, i.e., the salary amounts to at least 38.230EUR/year before taxes'
#t = 'Erfahrung bieten wir ein Bruttojahresgehalt ab EUR 36.400.'
'''
for line in t.splitlines():
    found = []
    r = re.search(pattern1,line)
    found.append(r[0] if r else r)
    
    for match in pattern2.finditer(line):
        found.append(match[0])

    print(found)

# ['Kollektivvertrag', '1.849,90', '1.900,32']
# [None, '35.362,00']
# [None, '2.800']
# ['Kollektivvertrag', '1.597,72']
# ['KV', '1.671,00']
# ['kollektivvertragliches', '2.026,88']
# [None, '50.000,']
# ['KV', '1.277,00']
# [None, '25.480']
# ['KV', '2.100,78']
# ['kollektivvertraglicher', '33.000']
# [None] (this one comes from \n ; such empty lines can be filtered out)
# [None, '2500,00']
# ['KV', '2.302']
# [None, '1.641,91', '1.859,21']
# ['Kollektivvertrag', '1.849,90']
# ['Kollektivvertrag', '38.230']
# [None, '36.400']

CodePudding user response:

Not sure if #t = ' is part of the string that you want in your return values, but using 3 capture groups where the first 2 are optional:

^.*?(?:\b([Kk]ollektivvertragliche[sn]|Kollektivvertrag|[Cc]ollective [Aa]greement|[Kk]ollektivvertr|KV)\b(.*?))?([0-9]{1,4}([.,])\d{2,3}(?:(?!\4)[.,]\d )?)

Regex demo

import re

pattern = r"^.*?(?:\b([Kk]ollektivvertragliche[sn]|Kollektivvertrag|[Cc]ollective [Aa]greement|[Kk]ollektivvertr|KV)\b(.*?))?([0-9]{1,4}([.,])\d{2,3}(?:(?!\4)[.,]\d )?)"
s = ("#t = 'EUR 35.362,00 Jahresbrutto'\n"
     "#t = '2.800 brutto/Monat'\n"
     "#t = 'laut Kollektivvertrag beträgt € 1.597,72 brutto pro Monat auf Basis Vollzeitbeschäftigung.'\n"
     "#t = 'KV-Mindestgehalt von monatlich € 1.671,00'\n"
     "#t = 'kollektivvertragliches Mindestgehalt von € 2.026,88 brutto pro Monat'\n"
     "#t = 'Bruttojahreseinkommen ab € 50.000,'\n"
     "#t = 'ein KV-Mindestlohn von EUR 1.277,00 brutto pro Monat'\n"
     "#t = 'beträgt jedoch mindestens € 25.480'\n"
     "#t = 'Gehalt lt. BAGS-KV €\\xa02.100,78 brutto'\n"
     "#t = 'kollektivvertraglicher Mindestgehalt EUR 33.000 Brutto/Jahr'\n"
     "#t = '\\nLohn/Gehalt ab EUR 2500,00 brutto monatlich,'\n"
     "#t = 'KV IT __EUR 2.302'\n"
     "#t = '25 Wochenstunden EUR [lineBreak] 1.641,91 bis EUR 1.859,21 brutto'\n"
     "#t = 'Entlohnung nach Caritas Kollektivvertrag: __ [lineBreak] Mindestgehalt in Gehaltsstufe 1 - Verwendungsgruppe VI , dzt. EUR 1.849,90'\n"
     "#t = 'The position is remunerated according to the Kollektivvertrag for Austrian Universities, i.e., the salary amounts to at least 38.230EUR/year before taxes'\n"
     "#t = 'Erfahrung bieten wir ein Bruttojahresgehalt ab EUR 36.400.")
matches = re.finditer(pattern, s, re.MULTILINE)

for matchNum, m in enumerate(matches, start=1):
    print((m.group(1), m.group(2), m.group(3)))

Output

(None, None, '35.362,00')
(None, None, '2.800')
('Kollektivvertrag', ' beträgt € ', '1.597,72')
('KV', '-Mindestgehalt von monatlich € ', '1.671,00')
('kollektivvertragliches', ' Mindestgehalt von € ', '2.026,88')
(None, None, '50.000')
('KV', '-Mindestlohn von EUR ', '1.277,00')
(None, None, '25.480')
('KV', ' €\\xa', '02.100,78')
(None, None, '33.000')
(None, None, '2500,00')
('KV', ' IT __EUR ', '2.302')
(None, None, '1.641,91')
('Kollektivvertrag', ': __ [lineBreak] Mindestgehalt in Gehaltsstufe 1 - Verwendungsgruppe VI , dzt. EUR ', '1.849,90')
('Kollektivvertrag', ' for Austrian Universities, i.e., the salary amounts to at least ', '38.230')
(None, None, '36.400')
  • Related