Home > Software design >  checking the data in the master list and comparing it with the column within dataframe
checking the data in the master list and comparing it with the column within dataframe

Time:01-18

Existing Dataframe:

Id      status            countries
01      pass        ['xyx','Indonesia','brazil']
02      fail        ['PQ','XT','sri lanka']
03      pass        ['spain', 'india','xtx']

Expected Dataframe :

Id      status            countries                      filtered_countries_name
01      pass        ['xyx','Indonesia','brazil']           'Indonesia','brazil'
02      fail        ['PQ','XT','sri lanka']                    'sri lanka'
03      pass        ['spain', 'india','xtx']                'spain', 'india'

i do have master list of specific countries(those i want to check) from where i am comparing the existing list in countries column.

my approach :

countries_list = ['china', 'india', 'united states', 'indonesia', 'brazil', 'pakistan', 'nigeria', 'bangladesh', 'russia', 'japan', 'mexico', 'philippines', 'vietnam', 'ethiopia', 'egypt', 'germany', 'iran', 'turkey', 'democratic republic of the congo', 'thailand', 'france', 'united kingdom', 'italy', 'burma', 'south africa', 'south korea', 'colombia', 'spain', 'ukraine', 'tanzania', 'kenya', 'argentina', 'algeria', 'poland', 'sudan', 'uganda','Indonesia','brazil','spain','sri lanka']

import re
countries_re = '|'.join(str(v) for v in countries_list )
df['filtered_countries_name'] = df['countries'].str.extractall(countries_re)

but unable to fetch with the same with this error

TypeError: incompatible index of inserted column with frame index

any leads..??

CodePudding user response:

If you have lists, use a list comprehension with a set as reference for efficiency:

S = set(countries_list)

df['filtered_countries_name'] = [[c for c in l if c.lower() in S]
                                 for l in df['countries']]

Output:

   Id status                 countries filtered_countries_name
0   1   pass  [xyx, Indonesia, brazil]     [Indonesia, brazil]
1   2   fail       [PQ, XT, sri lanka]             [sri lanka]
2   3   pass       [spain, india, xtx]          [spain, india]

CodePudding user response:

Using set intersections:

df = pd.DataFrame({'Id': {0: 1, 1: 2, 2: 3},
 'status': {0: 'pass', 1: 'fail', 2: 'pass'},
 'countries': {0: ['xyx', 'Indonesia', 'brazil'],
  1: ['PQ', 'XT', 'sri lanka'],
  2: ['spain', 'india', 'xtx']}})

countries_list = ['china', 'india', 'united states', 'indonesia', 'brazil', 'pakistan', 'nigeria', 'bangladesh', 'russia', 'japan', 'mexico', 'philippines', 'vietnam', 'ethiopia', 'egypt', 'germany', 'iran', 'turkey', 'democratic republic of the congo', 'thailand', 'france', 'united kingdom', 'italy', 'burma', 'south africa', 'south korea', 'colombia', 'spain', 'ukraine', 'tanzania', 'kenya', 'argentina', 'algeria', 'poland', 'sudan', 'uganda','Indonesia','brazil','spain','sri lanka']

df["filtered_names"] = df["countries"].apply(lambda x: list(set(x) & set(countries_list)))
df

# Id  status    countries                   filtered_names
# 0 1   pass    [xyx, Indonesia, brazil]    [Indonesia, brazil]
# 1 2   fail    [PQ, XT, sri lanka]         [sri lanka]
# 2 3   pass    [spain, india, xtx]         [india, spain]
  • Related