I have a list_a
list_a = ['some phrase austria', 'another slovakia', 'three paris', 'four united arab emirates']
Then I have a list of countries
countries = ['aruba', 'afghanistan', 'angola', 'austria', 'åland islands', 'albania', 'andorra', 'united arab emirates', 'argentina', 'armenia', 'american samoa', 'antarctica', 'slovakia', 'antigua and barbuda', 'australia']
I want to check if any of the items from countries list exist in every string in list_a and then remove that substring(country name) or replace it with ''.
So the correct output should look like this:
final_list = ['some phrase', 'another', 'three paris', 'four'] # since paris is not in the countries list
I tried with the following:
matching = [s for s in list_a if any(xs in s for xs in countries)]
print(matching)
So I have now a list of two maching items:
['some phrase austria', 'another slovakia', 'four united arab emirates']
And I am stuck here but I guess the solution is close.
CodePudding user response:
Nearly there. Try:
final_list = [s.split( )[0] if any(xs in s for xs in countries) else s for s in list_a]
print(final_list)
['one', 'two', 'three paris']
CodePudding user response:
If the strings are just space-separated words with no punctuation, it would be easiest to split()
those string, filter out all words that are in the countries
list, and then " ".join
them back together.
>>> [' '.join(w for w in s.split() if w not in countries) for s in list_a]
['some phrase', 'another', 'three paris']
If performance is an issue, you might want to convert countries
to a set
first to speed up the lookup. If there is punctuation, you might instead construct a regex from the countries, in the form \b(c1|c2|...|cn)\b
and replace those with the empty string.
>>> import re
>>> p = r"\b(" "|".join(countries) r")\b"
>>> [re.sub(p, "", w) for w in list_a]
['some phrase ', 'another ', 'three paris']
This method might leave some trailing or double spaces in the strings, though, that would need replacing in a second step.