For loop isn't filtering out keywords in series of pandas dataframe-CodePudding

Here is mock data representing my dataframe:

Arkansa  4  McKissic Creek Rd       HVAC/Plumbing
Florida  0  1130 Birdie Hills Rd    Lawn Care
Missouri 3  1140 Birdie Hills Rd    Interiors
Texas.   2  11625 Rainbow Ridge     Electrical

Here is my script that is supposed to be selecting the entire row out of the original dataframe (bool_filtered_df) and appending into a list (which I later convert to its own dataframe) if the sub-vertical (column 4) is "Electrical", "Duct Cleaning", "Water Treatment", or "Appliance Repair" however it is not filtering out any that I can tell and I can't see why. Any help appreciated. Thanks!

for d in range(0, len(bool_filtered_df)):
    try:
        if "Electrical" or "Duct Cleaning" or "Water Treatment" or "Appliance Repair" in bool_filtered_df['Sub-Vertical'][d]:
            Geneva_Trais.append(bool_filtered_df.iloc[d])
        else:
            print("Wrong Sub-Vert")
    except TypeError as t:
        print(t)
        continue

CodePudding user response：

This statement:

        if "Electrical" or "Duct Cleaning" or "Water Treatment" or "Appliance Repair" in bool_filtered_df['Sub-Vertical'][d]:

is interpreted as:

        if ("Electrical") or ("Duct Cleaning") or ("Water Treatment") or ("Appliance Repair" in bool_filtered_df['Sub-Vertical'][d]):

Since "Electrical" is true, the if statement will always be true. You can try this:

    if bool_filtered_df['Sub-Vertical'][d] in ('Electrical', 'Duct Cleaning', 'Water Treatment', 'Appliance Repair'):

CodePudding user response：

You can use built-in boolean filtering with Series.isin()

bool_filtered_df[bool_filtered_df['Sub-Vertical'].isin(['Electrical', 'Duct Cleaning', 'Water Treatment', 'Appliance Repair'])]

CodePudding user response：

Fixing the original problem

Tim Roberts described the problem.

This is sometimes called the "or gotcha". It is tempting for beginners to forget that Python is a programming language with strict rules, so they guess incorrectly at syntax that resembles English writing. This is one such incorrect guess.

One correct translation of your code would be:

Geneva_Trais = []
for d in range(0, len(bool_filtered_df)):
    sub_vert = bool_filtered_df['Sub-Vertical'].iloc[d]
    if (
        "Electrical" in sub_vert or
        "Duct Cleaning" in sub_vert or
        "Water Treatment" in sub_vert or
        "Appliance Repair" in sub_vert
    ):
        Geneva_Trais.append(bool_filtered_df.iloc[d])
    else:
        print(f"Wrong Sub-Vert: {idx}, {sub_vert}")

Note that you were mixing [] and .iloc[] in your original code as well, which is a very bad idea. [] on a Series is equivalent to .loc[], which is only equivalent to .iloc[] in very specific circusmtances. These circumstances happen to be the default, so newbies tend to gloss over the very important distinction.

Looping more efficiently

You can also loop over elements of a Series without using .iloc at each iteration, which is both faster and tidier-looking:

Geneva_Trais = []
for idx, sub_vert in bool_filtered_df['Sub-Vertical'].items():
    if (
        ("Electrical" in sub_vert) or
        ("Duct Cleaning" in sub_vert) or
        ("Water Treatment" in sub_vert) or
        ("Appliance Repair" in sub_vert)
    ):
        Geneva_Trais.append(bool_filtered_df.loc[idx])
    else:
        print(f"Wrong Sub-Vert: {idx}, {sub_vert}")

G. Anderson's loop-less approximation

If it makes sense to compare these strings exactly, with ==, then we can use the isin method as described in G. Anderson's answer and avoid looping entirely:

sub_verts = [
    'Electrical', 'Duct Cleaning', 'Water Treatment', 'Appliance Repair'
]

Geneva_Trais = bool_filtered_df.loc[
    bool_filtered_df['Sub-Vertical'].isin(sub_verts)
]

Based on the sample data you provided, this might have been your intention.

This technique also returns a DataFrame, not a list of Series.

Note that I personally prefer using .loc[] and not [] for Boolean subsetting of rows. I only use [] for selecting columns by name. This is to avoid selecting the wrong thing by mistake.

An exact loop-less translation

However, your original code does not have the same behavior as the .isin technique. Your code checks if each of the 4 strings is a sub-string of the value in the Series. The correct loop-less translation of your code would be:

def check_sub_vert(sub_vert):
    return (
        "Electrical" in sub_vert or
        "Duct Cleaning" in sub_vert or
        "Water Treatment" in sub_vert or
        "Appliance Repair" in sub_vert
    )

is_ok = bool_filtered_df['Sub-Vertical'].apply(check_sub_vert)
Geneva_Trais = bool_filtered_df.loc[is_ok]

The 4 in checks are still there, but we have moved them to a separate function. We can use .apply to gather the results of calling this function on each element, without using a for loop. This can be much faster than the loop in many cases, and in general is the preferred style of using Pandas. Sometimes this style is called "vectorized".