Currently facing an issue (data cleaning) of looking at multiple columns in a df that contain the word "food" and then creating a new column named 'food' with the found value
I can search for each specific column however running into issues when searching through the entire dataset (looking at every column and not overriding the previous entry)
s = df['tag_1'].str.contains('food')
df['food'] = df.tag_1.where(s)
Current:
tag_1 | tag_2 | tag_3 |
---|---|---|
placeholder | placeholder | French food |
British food | placeholder | placeholder |
placeholder | German food | placeholder |
Ideal:
tag_1 | tag_2 | tag_3 | food |
---|---|---|---|
placeholder | placeholder | French food | French food |
British food | placeholder | placeholder | British food |
placeholder | German food | placeholder | German food |
CodePudding user response:
You can mask and bfill
to get the first match:
mask = df.apply(lambda c: c.str.contains('food'))
df['food'] = df.where(mask).bfill(axis=1).iloc[:, 0]
output:
tag_1 tag_2 tag_3 food
0 placeholder placeholder French food French food
1 British food placeholder placeholder British food
2 placeholder German food placeholder German food
CodePudding user response:
Here are some additional ways by first stacking the data.
s = df.stack()
df.assign(food = s.loc[s.str.contains('food').loc[lambda x: x].index].to_numpy())
or
df.assign(food = df.stack().where(lambda x: x.str.contains('food')).groupby(level=0).first())
Output:
tag_1 tag_2 tag_3 food
0 placeholder placeholder French food French food
1 British food placeholder placeholder British food
2 placeholder German food placeholder German food