Pandas remove every entry with a specific value-CodePudding

I would like to go through every row (entry) in my df and remove every entry that has the value of " " (which yes is an empty string).

So if my data set is:

Name Gender Age
Jack         5
Anna    F    6
Carl    M    7
Jake    M    7

Therefore Jack would be removed from the dataset.

On another note, I would also like to remove entries that has the value "Unspecified" and "Undetermined" as well.

Eg:

Name Gender Age    Address
Jack         5    *address*
Anna    F    6    *address* 
Carl    M    7   Undetermined
Jake    M    7   Unspecified

Now, Jack will be removed due to empty field. Carl will be removed due to the value Undetermined present in a column. Jake will be removed due to the value Unspecified present in a column.

For now, this has been my approach but I keep getting a TypeError.

list = []
for i in df.columns:
    if df[i] == "":
        # everytime there is an empty string, add 1 to list
        list.append(1)
# count list to see how many entries there are with empty string
len(list)

Please help me with this. I would prefer a for loop being used due to there being about 22 columns and 9000 rows in my actual dataset.

Note - I do understand that there are other questions asked like this, its just that none of them apply to my situation, meaning that most of them are only useful for a few columns and I do not wish to hardcode all 22 columns.

Edit - Thank you for all your feedbacks, you all have been incredibly helpful.

CodePudding user response：

Well, OP actually wants to delete any column with "empty" string.

df = df[~(df=="").any(axis=1)] # deletes all rows that have empty string in any column.

If you want to delete specifically for address column, then you can just delete using

df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]

Or if any column with Undetermined or Unspecified, try similar as the first solution in my post, just by replacing the empty string with Undertermined or Unspecified.

df = df[~((df=="Undetermined") | (df=="Unspecified")).any(axis=1)]

CodePudding user response：

You can build masks and then filter the df according to it:

m1 = df.eq('').any(axis=1) 
# m1 is True if any cell in a row has an empty string

m2 = df['Address'].isin(['Undetermined', 'Unspecified'])
# m2 is True if a row has one of the values in the list in column 'Address'

out = df[~m1 & ~m2] # invert both condition and get the desired output
print(out)

Output:

   Name Gender  Age    Address
1  Anna      F    6  *address*

Used Input:

df = pd.DataFrame({'Name': ['Jack', 'Anna', 'Carl', 'Jake'],
 'Gender': ['', 'F', 'M', 'M'],
 'Age': [5, 6, 7, 7],
 'Address': ['*address*', '*address*', 'Undetermined', 'Unspecified']}
)

CodePudding user response：

To delete a row based on a condition use the following:

df = df.drop(df[condition].index)

For example: df = df.drop(df[Age==5].index) , will drop the row where the Age is 5.

I've come across a post regarding the same dating back to 2017, it should help you understand it more clearer.

CodePudding user response：

Regarding question 2, here's how to remove rows with the specified values in a given column:

df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]

CodePudding user response：

Let's assume we have a Pandas DataFrame object df.

To remove every row given your conditions, simply do:

df = df[df.Gender == " " or df.df.Age == " " or df.Address in [" ", "Undetermined", "Unspecified"]]

If the unspecified fields are NaN, you can also do:

df = df.dropna(how="any", axis = 0)

CodePudding user response：

Answer from @ThatCSFresher or @Bence will help you out in removing rows based on single column... Which is great!

However, I think there are multiple condition in your query needed to check across multiple columns at once in a loop. So, probably apply-lambda can do the job; Try the following code;

df = pd.DataFrame({"Name":["Jack","Anna","Carl","Jake"],
                   "Gender":["","F","M","M"],
                   "Age":[5,6,7,7],
                   "Address":["address","address","Undetermined","Unspecified"]})

df["Noise_Tag"] = df.apply(lambda x: "Noise" if ("" in list(x)) or ("Undetermined" in list(x)) or ("Unspecified" in list(x)) else "No Noise",axis=1)
df1 = df[df["Noise_Tag"] == "No Noise"]
del df1["Noise_Tag"]

# Output of df;
    Name    Gender  Age Address   Noise_Tag
0   Jack        5   address       Noise
1   Anna    F   6   address       No Noise
2   Carl    M   7   Undetermined  Noise
3   Jake    M   7   Unspecified   Noise

# Output of df1;
    Name    Gender  Age Address
1   Anna    F   6   address

CodePudding user response：

using lambda fun

Code:

df[df.apply(lambda x: False if (x.Address in ['Undetermined', 'Unspecified'] or '' in list(x)) else True, axis=1)]

Output:

    Name    Gender  Age Address
1   Anna    F       6   *add