Home > Mobile >  Finding the rows which has more 0 values in pandas DataFrame
Finding the rows which has more 0 values in pandas DataFrame

Time:03-12

I am working on the data preprocessing for developing the Neural Network Regression model. For that, I am having the raw data in which some of the station ID and against that Station ID, there are more blank or NaN values than any interger or real number. So How would I deal with it. Should I simply delete it? If yes, then how would I find these Station Ids and can delete those row.

station_Id   Avg_temp  Max_Temp  rel_hum  avg wind
105           0                             1.4
198                      0           1      8.4
788           122        7           4      47

Above table just a small part of my dataset. I am having 164040 rows × 12 columns. How can I find these rows?

CodePudding user response:

df.dropna(subset="Avg_temp", inplace = True)

will drop rows where Avg_temp == NaN.

df["Avg_temp"].fillna(value = df["Avg_temp"].mean(), inplace = True)

will fill NaN values in Avg_temp with the mean temperature. Likewise for median, etc.

CodePudding user response:

The data is incomplete so ultimately it won't make sense to use it as an input for a neural network. I suggest removing the incomplete rows. df.dropna()

https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/

CodePudding user response:

First you should consider some sort of feature engineering so that you have fields that can give you the proper signals for your model. In addition to other techniques such as dimensionality reduction or class unbalancing. It's a empirical question with the data that's been showed to us.

About dropping rows with zero, see this: Drop rows with all zeros in pandas data frame

CodePudding user response:

To drop rows with less than a certain number of real data values, use df.dropna with a thresh. I added another column, so I could keep all rows with more than 3 data values.

import pandas as pd
import numpy as np
df = pd.DataFrame({"station_Id": [105, 198, 788], "Avg_temp": [0, np.nan, 122], "Max_Temp": [np.nan, 0, 7],
                   "rel_hum": [np.nan, 1, 4], "avg wind": [1.4, 8.4, 47], "another_column": [np.nan, np.nan, 5]})
df.set_index("station_Id", inplace=True)
my_threshold = int(np.ceil(df.shape[1]/2))
print(df.shape[0]) # prints 3
df.dropna(thresh=3, inplace=True)
print(df.shape[0]) # prints 2

However for machine learning you should try imputation to fill the missing data. For example, you could fill missing data with the mean from the other observations. Talk with an expert in your domain regarding whatever method makes the best sense.

  • Related