Here is mock data representing my dataframe:
Arkansa 4 McKissic Creek Rd HVAC/Plumbing
Florida 0 1130 Birdie Hills Rd Lawn Care
Missouri 3 1140 Birdie Hills Rd Interiors
Texas. 2 11625 Rainbow Ridge Electrical
Here is my script that is supposed to be selecting the entire row out of the original dataframe (bool_filtered_df) and appending into a list (which I later convert to its own dataframe) if the sub-vertical (column 4) is "Electrical", "Duct Cleaning", "Water Treatment", or "Appliance Repair" however it is not filtering out any that I can tell and I can't see why. Any help appreciated. Thanks!
for d in range(0, len(bool_filtered_df)):
try:
if "Electrical" or "Duct Cleaning" or "Water Treatment" or "Appliance Repair" in bool_filtered_df['Sub-Vertical'][d]:
Geneva_Trais.append(bool_filtered_df.iloc[d])
else:
print("Wrong Sub-Vert")
except TypeError as t:
print(t)
continue
CodePudding user response:
This statement:
if "Electrical" or "Duct Cleaning" or "Water Treatment" or "Appliance Repair" in bool_filtered_df['Sub-Vertical'][d]:
is interpreted as:
if ("Electrical") or ("Duct Cleaning") or ("Water Treatment") or ("Appliance Repair" in bool_filtered_df['Sub-Vertical'][d]):
Since "Electrical" is true, the if statement will always be true. You can try this:
if bool_filtered_df['Sub-Vertical'][d] in ('Electrical', 'Duct Cleaning', 'Water Treatment', 'Appliance Repair'):
CodePudding user response:
You can use built-in boolean filtering with Series.isin()
bool_filtered_df[bool_filtered_df['Sub-Vertical'].isin(['Electrical', 'Duct Cleaning', 'Water Treatment', 'Appliance Repair'])]
CodePudding user response:
Fixing the original problem
Tim Roberts described the problem.
This is sometimes called the "or
gotcha". It is tempting for beginners to forget that Python is a programming language with strict rules, so they guess incorrectly at syntax that resembles English writing. This is one such incorrect guess.
One correct translation of your code would be:
Geneva_Trais = []
for d in range(0, len(bool_filtered_df)):
sub_vert = bool_filtered_df['Sub-Vertical'].iloc[d]
if (
"Electrical" in sub_vert or
"Duct Cleaning" in sub_vert or
"Water Treatment" in sub_vert or
"Appliance Repair" in sub_vert
):
Geneva_Trais.append(bool_filtered_df.iloc[d])
else:
print(f"Wrong Sub-Vert: {idx}, {sub_vert}")
Note that you were mixing []
and .iloc[]
in your original code as well, which is a very bad idea. []
on a Series is equivalent to .loc[]
, which is only equivalent to .iloc[]
in very specific circusmtances. These circumstances happen to be the default, so newbies tend to gloss over the very important distinction.
Looping more efficiently
You can also loop over elements of a Series without using .iloc
at each iteration, which is both faster and tidier-looking:
Geneva_Trais = []
for idx, sub_vert in bool_filtered_df['Sub-Vertical'].items():
if (
("Electrical" in sub_vert) or
("Duct Cleaning" in sub_vert) or
("Water Treatment" in sub_vert) or
("Appliance Repair" in sub_vert)
):
Geneva_Trais.append(bool_filtered_df.loc[idx])
else:
print(f"Wrong Sub-Vert: {idx}, {sub_vert}")
G. Anderson's loop-less approximation
If it makes sense to compare these strings exactly, with ==
, then we can use the isin
method as described in G. Anderson's answer and avoid looping entirely:
sub_verts = [
'Electrical', 'Duct Cleaning', 'Water Treatment', 'Appliance Repair'
]
Geneva_Trais = bool_filtered_df.loc[
bool_filtered_df['Sub-Vertical'].isin(sub_verts)
]
Based on the sample data you provided, this might have been your intention.
This technique also returns a DataFrame, not a list of Series.
Note that I personally prefer using .loc[]
and not []
for Boolean subsetting of rows. I only use []
for selecting columns by name. This is to avoid selecting the wrong thing by mistake.
An exact loop-less translation
However, your original code does not have the same behavior as the .isin
technique. Your code checks if each of the 4 strings is a sub-string of the value in the Series. The correct loop-less translation of your code would be:
def check_sub_vert(sub_vert):
return (
"Electrical" in sub_vert or
"Duct Cleaning" in sub_vert or
"Water Treatment" in sub_vert or
"Appliance Repair" in sub_vert
)
is_ok = bool_filtered_df['Sub-Vertical'].apply(check_sub_vert)
Geneva_Trais = bool_filtered_df.loc[is_ok]
The 4 in
checks are still there, but we have moved them to a separate function. We can use .apply
to gather the results of calling this function on each element, without using a for
loop. This can be much faster than the loop in many cases, and in general is the preferred style of using Pandas. Sometimes this style is called "vectorized".