I'm trying to determine the value of a column based on the value of 3 other columns.
The logic is, if ColC
contains the string value 'Deck', populate the new column (ColD
) with the value 'Deck'. If not, check ColB
for the word 'Deck'. If it's not there, check ColA
. If not there, just set ColD
to empty.
However, I then want to repeat the same process, but for the word 'Beam'. Then again for the word 'Wall', then one last time with the word 'Grating'.
Here's a snippet of the data:
ColA ColB ColC
Deck DECK PENETRATION
DECK
Deck
WALL PENETRATION
PRIMARY GRATING
Beam PRIMARY BEAM PENE
This is what the above should end up looking like:
ColA ColB ColC ColD
Deck DECK PENETRATION Deck
DECK Deck
Deck Deck
WALL PENETRATION Wall
PRIMARY GRATING Grating
Beam PRIMARY BEAM PENE Beam
There may be other words I'd like to check against ColA-ColC, so ideally the solution would involve passing a list of words to a function of some sort. The list right now would look like ['Deck', 'Beam', 'Wall', 'Grating']
Although unlikely, if the code finds that one column (e.g. ColA
) value has the word 'Grating' and another column (e.g. ColB
) has the word 'Deck' - it would just fill ColD
with the word 'Conflict'.
I think I can do this using np.where
but I'm struggling to set something up that works. For example, if the code is checking for the word 'Grating' and doesn't find it, I don't want it to overwrite ColD
with nothing, where it may have already been populated with the word Beam
, Deck
, etc.
Edit:
So something like this almost works if I use in conjunction with df.apply
setting axis = 1
.
However, in the new column, it only contains values 'Deck' and 'Unset' after running it. Something is off with my logic and understanding of how apply
works.
def my_function(row):
li = ['Deck', 'Grating', 'Wall', 'Beam']
for x in li:
if x.lower() in row.A.lower():
return x
elif x.lower() in row.B.lower():
return x
elif x.lower() in row.C.lower():
return x
else:
return 'unset'
df['ColD'] = df.apply(my_function, axis = 1)
CodePudding user response:
Here's one approach. I changed your input so that the first row would lead to a conflict
import pandas as pd
import io
df = pd.read_csv(io.StringIO(
"""
ColA,ColB,ColC
Deck,DECK PENETRATION, BEAM
,DECK,
Deck,,
,WALL PENETRATION,
,PRIMARY GRATING,
Beam,,PRIMARY BEAM PENE
"""))
df['ColD'] = None
words = ['Deck', 'Beam', 'Wall', 'Grating']
#join the rows together as strings to search through
row_text = df.fillna('').sum(axis=1).str.lower()
for word in words:
contains_inds = row_text.str.contains(word.lower())
inds_to_fill = contains_inds & df['ColD'].isnull()
inds_to_conflict = contains_inds & df['ColD'].notnull()
df.loc[inds_to_fill,'ColD'] = word
df.loc[inds_to_conflict,'ColD'] = 'Conflict'
df