Existing Dataframe:
Id status countries
01 pass ['xyx','Indonesia','brazil']
02 fail ['PQ','XT','sri lanka']
03 pass ['spain', 'india','xtx']
Expected Dataframe :
Id status countries filtered_countries_name
01 pass ['xyx','Indonesia','brazil'] 'Indonesia','brazil'
02 fail ['PQ','XT','sri lanka'] 'sri lanka'
03 pass ['spain', 'india','xtx'] 'spain', 'india'
i do have master list of specific countries(those i want to check) from where i am comparing the existing list in countries column.
my approach :
countries_list = ['china', 'india', 'united states', 'indonesia', 'brazil', 'pakistan', 'nigeria', 'bangladesh', 'russia', 'japan', 'mexico', 'philippines', 'vietnam', 'ethiopia', 'egypt', 'germany', 'iran', 'turkey', 'democratic republic of the congo', 'thailand', 'france', 'united kingdom', 'italy', 'burma', 'south africa', 'south korea', 'colombia', 'spain', 'ukraine', 'tanzania', 'kenya', 'argentina', 'algeria', 'poland', 'sudan', 'uganda','Indonesia','brazil','spain','sri lanka']
import re
countries_re = '|'.join(str(v) for v in countries_list )
df['filtered_countries_name'] = df['countries'].str.extractall(countries_re)
but unable to fetch with the same with this error
TypeError: incompatible index of inserted column with frame index
any leads..??
CodePudding user response:
If you have lists, use a list comprehension with a set
as reference for efficiency:
S = set(countries_list)
df['filtered_countries_name'] = [[c for c in l if c.lower() in S]
for l in df['countries']]
Output:
Id status countries filtered_countries_name
0 1 pass [xyx, Indonesia, brazil] [Indonesia, brazil]
1 2 fail [PQ, XT, sri lanka] [sri lanka]
2 3 pass [spain, india, xtx] [spain, india]
CodePudding user response:
Using set intersections:
df = pd.DataFrame({'Id': {0: 1, 1: 2, 2: 3},
'status': {0: 'pass', 1: 'fail', 2: 'pass'},
'countries': {0: ['xyx', 'Indonesia', 'brazil'],
1: ['PQ', 'XT', 'sri lanka'],
2: ['spain', 'india', 'xtx']}})
countries_list = ['china', 'india', 'united states', 'indonesia', 'brazil', 'pakistan', 'nigeria', 'bangladesh', 'russia', 'japan', 'mexico', 'philippines', 'vietnam', 'ethiopia', 'egypt', 'germany', 'iran', 'turkey', 'democratic republic of the congo', 'thailand', 'france', 'united kingdom', 'italy', 'burma', 'south africa', 'south korea', 'colombia', 'spain', 'ukraine', 'tanzania', 'kenya', 'argentina', 'algeria', 'poland', 'sudan', 'uganda','Indonesia','brazil','spain','sri lanka']
df["filtered_names"] = df["countries"].apply(lambda x: list(set(x) & set(countries_list)))
df
# Id status countries filtered_names
# 0 1 pass [xyx, Indonesia, brazil] [Indonesia, brazil]
# 1 2 fail [PQ, XT, sri lanka] [sri lanka]
# 2 3 pass [spain, india, xtx] [india, spain]