Home > Net >  Pandas remove every entry with a specific value
Pandas remove every entry with a specific value

Time:10-01

I would like to go through every row (entry) in my df and remove every entry that has the value of " " (which yes is an empty string).

So if my data set is:

Name Gender Age
Jack         5
Anna    F    6
Carl    M    7
Jake    M    7

Therefore Jack would be removed from the dataset.

On another note, I would also like to remove entries that has the value "Unspecified" and "Undetermined" as well.

Eg:

Name Gender Age    Address
Jack         5    *address*
Anna    F    6    *address* 
Carl    M    7   Undetermined
Jake    M    7   Unspecified

Now, Jack will be removed due to empty field. Carl will be removed due to the value Undetermined present in a column. Jake will be removed due to the value Unspecified present in a column.

For now, this has been my approach but I keep getting a TypeError.

list = []
for i in df.columns:
    if df[i] == "":
        # everytime there is an empty string, add 1 to list
        list.append(1)
# count list to see how many entries there are with empty string
len(list)

Please help me with this. I would prefer a for loop being used due to there being about 22 columns and 9000 rows in my actual dataset.

Note - I do understand that there are other questions asked like this, its just that none of them apply to my situation, meaning that most of them are only useful for a few columns and I do not wish to hardcode all 22 columns.

Edit - Thank you for all your feedbacks, you all have been incredibly helpful.

CodePudding user response:

Well, OP actually wants to delete any column with "empty" string.

df = df[~(df=="").any(axis=1)] # deletes all rows that have empty string in any column.

If you want to delete specifically for address column, then you can just delete using

df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]

Or if any column with Undetermined or Unspecified, try similar as the first solution in my post, just by replacing the empty string with Undertermined or Unspecified.

df = df[~((df=="Undetermined") | (df=="Unspecified")).any(axis=1)]

CodePudding user response:

You can build masks and then filter the df according to it:

m1 = df.eq('').any(axis=1) 
# m1 is True if any cell in a row has an empty string

m2 = df['Address'].isin(['Undetermined', 'Unspecified'])
# m2 is True if a row has one of the values in the list in column 'Address'

out = df[~m1 & ~m2] # invert both condition and get the desired output
print(out)

Output:

   Name Gender  Age    Address
1  Anna      F    6  *address*

Used Input:

df = pd.DataFrame({'Name': ['Jack', 'Anna', 'Carl', 'Jake'],
 'Gender': ['', 'F', 'M', 'M'],
 'Age': [5, 6, 7, 7],
 'Address': ['*address*', '*address*', 'Undetermined', 'Unspecified']}
)

CodePudding user response:

To delete a row based on a condition use the following:

df = df.drop(df[condition].index)

For example: df = df.drop(df[Age==5].index) , will drop the row where the Age is 5.

I've come across a post regarding the same dating back to 2017, it should help you understand it more clearer.

CodePudding user response:

Regarding question 2, here's how to remove rows with the specified values in a given column:

df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]

CodePudding user response:

Let's assume we have a Pandas DataFrame object df.

To remove every row given your conditions, simply do:

df = df[df.Gender == " " or df.df.Age == " " or df.Address in [" ", "Undetermined", "Unspecified"]]

If the unspecified fields are NaN, you can also do:

df = df.dropna(how="any", axis = 0)

CodePudding user response:

Answer from @ThatCSFresher or @Bence will help you out in removing rows based on single column... Which is great!

However, I think there are multiple condition in your query needed to check across multiple columns at once in a loop. So, probably apply-lambda can do the job; Try the following code;

df = pd.DataFrame({"Name":["Jack","Anna","Carl","Jake"],
                   "Gender":["","F","M","M"],
                   "Age":[5,6,7,7],
                   "Address":["address","address","Undetermined","Unspecified"]})

df["Noise_Tag"] = df.apply(lambda x: "Noise" if ("" in list(x)) or ("Undetermined" in list(x)) or ("Unspecified" in list(x)) else "No Noise",axis=1)
df1 = df[df["Noise_Tag"] == "No Noise"]
del df1["Noise_Tag"]

# Output of df;
    Name    Gender  Age Address   Noise_Tag
0   Jack        5   address       Noise
1   Anna    F   6   address       No Noise
2   Carl    M   7   Undetermined  Noise
3   Jake    M   7   Unspecified   Noise

# Output of df1;
    Name    Gender  Age Address
1   Anna    F   6   address

CodePudding user response:

using lambda fun

Code:

df[df.apply(lambda x: False if (x.Address in ['Undetermined', 'Unspecified'] or '' in list(x)) else True, axis=1)]

Output:

    Name    Gender  Age Address
1   Anna    F       6   *add
  • Related