I have a list names
.
names = ['Dr. Augsten, BÜNDNIS 90/DIE GRÜNEN', 'Dirk Adams, GRÜNE', 'Blechschmidt, DIE LINKE', 'Steffen Harzer, LINKE', 'Gerd Schuchardt, Minister für Wissenschaft, Forschung und Kultur', 'David-Christian Eckardt, SPD', 'Christine Ursula Klaus, SPD', 'Klaus von der Krone, CDU', 'Antje Ehrlich-Strathausen, SPD', 'Benno Lemke, PDS']
names = [re.sub('(?<!DIE)\sLINKE', ' DIE LINKE', line) for line in names]
names = [re.sub('(?<!DIE)\sGRÜNE', ' BÜNDNIS 90/DIE GRÜNEN', line) for line in names]
names = [re.sub('Die Linke', 'DIE LINKE', line) for line in names]
names = [re.sub('PDS', 'DIE LINKE', line) for line in names]
names = [re.sub('Dr.\s', '', line) for line in names]
actual_names = [re.sub('((?:^|(?:[.!?]\s))(\w )\s)', '', line) for line in names]
print(actual_names)
actual_names = ['Augsten, BÜNDNIS 90/DIE GRÜNEN', 'Adams, BÜNDNIS 90/DIE GRÜNEN', 'Blechschmidt, DIE LINKE', 'Harzer, DIE LINKE', 'Schuchardt, Minister für Wissenschaft, Forschung und Kultur', 'David-Christian Eckardt, SPD', 'Ursula Klaus, SPD', 'von der Krone, CDU', 'Ehrlich-Strathausen, SPD', 'Lemke, DIE LINKE']
Questions:
- How do i need to change the regex in order to account for the names that have a
-
within them (see'David-Christian Eckardt, SPD'
- How do i need to change the code in order to keep the original elements?
desired_names = ['Augsten, BÜNDNIS 90/DIE GRÜNEN', 'Adams, BÜNDNIS 90/DIE GRÜNEN', 'Adams, GRÜNE', 'Blechschmidt, DIE LINKE', 'Harzer, DIE LINKE', 'Harzer, LINKE', 'Schuchardt, Minister für Wissenschaft, Forschung und Kultur', 'Eckardt, SPD', 'Klaus, SPD', 'von der Krone, CDU', 'Ehrlich-Strathausen, SPD', 'Lemke, PDS', 'Lemke, DIE LINKE']
Order within list does not matter
CodePudding user response:
Is regex in this case necessary? You can use str.split
with maxsplit=1
parameter:
names = [
"Dr. Augsten, BÜNDNIS 90/DIE GRÜNEN",
"Dirk Adams, GRÜNE",
"Blechschmidt, DIE LINKE",
"Steffen Harzer, LINKE",
"Gerd Schuchardt, Minister für Wissenschaft, Forschung und Kultur",
"David-Christian Eckardt, SPD",
"Christine Ursula Klaus, SPD",
"Klaus von der Krone, CDU",
"Antje Ehrlich-Strathausen, SPD",
"Benno Lemke, PDS",
]
m = {"LINKE": "DIE LINKE", "GRÜNE": "BÜNDNIS 90/DIE GRÜNEN", "PDS": "DIE LINKE"}
out = [n.split(", ", maxsplit=1) for n in names]
out = [", ".join([a.split()[-1], m.get(b, b)]) for a, b in out]
print(out)
Prints:
[
"Augsten, BÜNDNIS 90/DIE GRÜNEN",
"Adams, BÜNDNIS 90/DIE GRÜNEN",
"Blechschmidt, DIE LINKE",
"Harzer, DIE LINKE",
"Schuchardt, Minister für Wissenschaft, Forschung und Kultur",
"Eckardt, SPD",
"Klaus, SPD",
"Krone, CDU",
"Ehrlich-Strathausen, SPD",
"Lemke, DIE LINKE",
]