I am trying to find multiple patterns (the first 3 elements of a string) over multiple columns. Up till now, I am able to find one pattern over multiple columns with the following code:
df['C07_location'] = df[colnames_locations].applymap(lambda x: 'C07' in x).any(1).astype(int)
In this case, it looks for the string C07 in all the columns with locations. However I have 30 of these locations I want to look for, which looks something like this:
unique_locations = ['C07', 'C08', 'C11', 'C14']
This is an example of what the original dataset looks like:
location_1 location_2 location_3 ...
0 C110 C072 NaN
1 NaN NaN NaN
2 C147 C144 C112
3 C082 C079 NaN
4 C071 C110 C145
... ... ... ...
I would like to create a new column for each unique location, with the end result looking like this:
location_1 location_2 location_3 C07_location C08_location C11_location ...
0 C110 C072 NaN 1 0 1
1 NaN NaN NaN 0 0 0
2 C147 C144 C112 0 0 1
3 C082 C079 NaN 1 1 0
4 C071 C110 C145 1 0 1
... ... ... ... ... ... ...
Any guidance in the right direction is much appreciated!
CodePudding user response:
Use Series.str.extract
for get columnsnames by list and then pass to get_dummies
:
unique_locations = ['C07', 'C08', 'C11', 'C14']
f = lambda x: x.str.extract(f'({"|".join(unique_locations)})', expand=False)
df = (df.join(pd.get_dummies(df.apply(f), prefix='', prefix_sep='')
.groupby(axis=1, level=0)
.max()
.add_suffix('_location')))
print (df)
location_1 location_2 location_3 C07_location C08_location C11_location \
0 C110 C072 NaN 1 0 1
1 NaN NaN NaN 0 0 0
2 C147 C144 C112 0 0 1
3 C082 C079 NaN 1 1 0
4 C071 C110 C145 1 0 1
C14_location
0 0
1 0
2 1
3 0
4 1
CodePudding user response:
If i understood your question correctly, you want to create separate columns for unique_locations
.
If that is the case then you can simply use a for loop
for loc in unique_locations:
df[loc '_location']=(df[colnames_locations].values == loc).any(1).astype(int)