Searching for multiple strings over multiple columns in Python-CodePudding

I am trying to find multiple patterns (the first 3 elements of a string) over multiple columns. Up till now, I am able to find one pattern over multiple columns with the following code:

df['C07_location'] = df[colnames_locations].applymap(lambda x: 'C07' in x).any(1).astype(int)

In this case, it looks for the string C07 in all the columns with locations. However I have 30 of these locations I want to look for, which looks something like this:

unique_locations = ['C07', 'C08', 'C11', 'C14']

This is an example of what the original dataset looks like:

    location_1     location_2      location_3   ...
0         C110           C072             NaN 
1          NaN            NaN             NaN
2         C147           C144            C112
3         C082           C079             NaN
4         C071           C110            C145
...        ...            ...             ...

I would like to create a new column for each unique location, with the end result looking like this:

    location_1     location_2      location_3   C07_location  C08_location  C11_location  ...
0         C110           C072             NaN              1             0             1 
1          NaN            NaN             NaN              0             0             0
2         C147           C144            C112              0             0             1
3         C082           C079             NaN              1             1             0
4         C071           C110            C145              1             0             1
...        ...            ...             ...            ...           ...           ...

Any guidance in the right direction is much appreciated!

CodePudding user response：

Use Series.str.extract for get columnsnames by list and then pass to get_dummies:

unique_locations = ['C07', 'C08', 'C11', 'C14']

f = lambda x: x.str.extract(f'({"|".join(unique_locations)})', expand=False)
df = (df.join(pd.get_dummies(df.apply(f), prefix='', prefix_sep='')
                .groupby(axis=1, level=0)
                .max()
                .add_suffix('_location')))


print (df)
  location_1 location_2 location_3  C07_location  C08_location  C11_location  \
0       C110       C072        NaN             1             0             1   
1        NaN        NaN        NaN             0             0             0   
2       C147       C144       C112             0             0             1   
3       C082       C079        NaN             1             1             0   
4       C071       C110       C145             1             0             1   

   C14_location  
0             0  
1             0  
2             1  
3             0  
4             1

CodePudding user response：

If i understood your question correctly, you want to create separate columns for unique_locations. If that is the case then you can simply use a for loop

for loc in unique_locations:
   df[loc '_location']=(df[colnames_locations].values == loc).any(1).astype(int)