Remove feminine ending in german using regex python-CodePudding

In german language feminine endings are ['/innen','/in','/Innen','/In','Innen','In','innen']. I want to remove them from the strings, that are in list.

I have come up with the following:

rm_gender = ['/innen','/in','/Innen','/In','Innen','In','innen']
test_list = ['Softwareentwickler',
 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker',
 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker',
 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker',
 'Softwareentwickler',
 'Softwareentwickler',
 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker',
 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker',
 'Softwareentwickler',
 'Softwareentwickler',
 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker',
 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker',
 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker',
 'Data Scientist; DWH-BI Consultant; SoftwareentwicklerInnen; InformatikerInnen; Statistiker',
 'Data Scientist; DWH-BI Consultant; SoftwareentwicklerInnen; InformatikerInnen; Statistiker',
 'Data Scientist; DWH-BI Consultant; SoftwareentwicklerInnen; InformatikerInnen; Statistiker',
 'Data Scientist; DWH-BI Consultant; SoftwareentwicklerInnen; InformatikerInnen; Statistiker',
 'Hard-Softwareentwickler',
 'Data Scientist; DWH-BI Consultant; SoftwareentwicklerInnen; InformatikerInnen; Statistiker',
 'Hard-Softwareentwickler',
 'Hard-Softwareentwickler',
 'Hard-Softwareentwickler']

result = [vac if any([substring in vac for substring in ['-In',' In']]) else re.sub('|'.join(rm_gender),'',vac) if vac[:2] not in 'In' else 'In'   re.sub('|'.join(rm_gender),'',vac) for vac in test_list]

But it doesn't work, because there is a space in front of words like 'SoftwareentwicklerInnen'. How can i correctly do it with regex?

Important is: i want to keep format of the string as it is. Just need to remove feminine ending( or I want to return corrected list of strings)

CodePudding user response：

Try this one:

import re

test_list = test_list[0].split(";")
test_list.append("Informatikerin") # adding one ending with in - I don't know if this is a correct word!

pattern = re.compile("in(?:nen)?$", re.IGNORECASE)

[re.sub(pattern, "", x) for x in test_list]

OUTPUT

['Data Scientists', ' DWH-BI Consultants', ' Softwareentwickler', ' Informatiker', ' Statistiker', 'Informatiker']

FOLLOW UP

If you want to rebuild the string as it was, jusr rejoin by ";":

";".join([re.sub(pattern, "", x) for x in test_list])

OUTPUT

'Data Scientists; DWH-BI Consultants; Softwareentwickler; Informatiker; Statistiker;Informatiker'

If the idea is to match all the words in each line:

pattern = re.compile("(in(?:nen)?)(?=;|\.|,|;| |:|$)", re.IGNORECASE)

re.sub(pattern, "", "You are a Softwareentwicklerinnen: that is as nice as Informatikerin")
re.sub(pattern, "", "You are a Softwareentwicklerinnen; that is as nice as Informatikerin")

OUTPUT

'You are a Softwareentwickler: that is as nice as Informatiker'
'You are a Softwareentwickler; that is as nice as Informatiker'

CodePudding user response：

You could convert matches of the following regular expression to empty strings:

\/?[Ii](?:nnen|n)\b

Demo

This regex can be broken down as follows.

\/?         # optionally match '/'
[Ii]        # match 'I' or 'i'
(?:nnen|n)  # match 'nnen' or 'n' (in that order)
\b          # match a word boundary

The word boundary is to prevent matches of strings such as `innenantenne'

CodePudding user response：

You can use

rm_gender_regex = re.compile( r'(?:\b/|\B)i(?:nne)?n\b', re.I )
result = [rm_gender_regex.sub('', vac) for vac in test_list]

See the regex demo. Details:

(?:\b/|\B) - either a / that is preceded with a word char or a position that is preceded with a word char
i - i
(?:nne)? - an optional nne substring
n - a n char
\b - a word boundary.

See the Python demo:

import re
test_list = ['Softwareentwickler', 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker', 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker', 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker', 'Softwareentwickler', 'Softwareentwickler', 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker', 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker', 'Softwareentwickler', 'Softwareentwickler', 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker', 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker', 'Data Scientists; DWH-BI Consultants; SoftwareentwicklerInnen; InformatikerInnen; Statistiker',  'Data Scientist; DWH-BI Consultant; SoftwareentwicklerInnen; InformatikerInnen; Statistiker', 'Data Scientist; DWH-BI Consultant; SoftwareentwicklerInnen; InformatikerInnen; Statistiker', 'Data Scientist; DWH-BI Consultant; SoftwareentwicklerInnen; InformatikerInnen; Statistiker', 'Data Scientist; DWH-BI Consultant; SoftwareentwicklerInnen; InformatikerInnen; Statistiker', 'Hard-Softwareentwickler', 'Data Scientist; DWH-BI Consultant; SoftwareentwicklerInnen; InformatikerInnen; Statistiker', 'Hard-Softwareentwickler', 'Hard-Softwareentwickler', 'Hard-Softwareentwickler']
rm_gender_regex = re.compile( r'(?:\b/|\B)i(?:nne)?n\b', re.I )
result = [rm_gender_regex.sub('', vac) for vac in test_list]
for x in result:
    print(x)

Output:

Softwareentwickler
Data Scientists; DWH-BI Consultants; Softwareentwickler; Informatiker; Statistiker
Data Scientists; DWH-BI Consultants; Softwareentwickler; Informatiker; Statistiker
Data Scientists; DWH-BI Consultants; Softwareentwickler; Informatiker; Statistiker
Softwareentwickler
Softwareentwickler
Data Scientists; DWH-BI Consultants; Softwareentwickler; Informatiker; Statistiker
Data Scientists; DWH-BI Consultants; Softwareentwickler; Informatiker; Statistiker
Softwareentwickler
Softwareentwickler
Data Scientists; DWH-BI Consultants; Softwareentwickler; Informatiker; Statistiker
Data Scientists; DWH-BI Consultants; Softwareentwickler; Informatiker; Statistiker
Data Scientists; DWH-BI Consultants; Softwareentwickler; Informatiker; Statistiker
Data Scientist; DWH-BI Consultant; Softwareentwickler; Informatiker; Statistiker
Data Scientist; DWH-BI Consultant; Softwareentwickler; Informatiker; Statistiker
Data Scientist; DWH-BI Consultant; Softwareentwickler; Informatiker; Statistiker
Data Scientist; DWH-BI Consultant; Softwareentwickler; Informatiker; Statistiker
Hard-Softwareentwickler
Data Scientist; DWH-BI Consultant; Softwareentwickler; Informatiker; Statistiker
Hard-Softwareentwickler
Hard-Softwareentwickler
Hard-Softwareentwickler