Home > Enterprise >  How to filter rows based on cell contents in a row-based expression
How to filter rows based on cell contents in a row-based expression

Time:12-30

I read some data from a file. The first column is assigned 'object' type because of the XXX in the very first data row:

tips = pd.read_csv("tips.csv")
print(tips.head())
print(tips.info())

total_bill   tip     sex smoker  day    time  size    
0        xxx  1.01  Female     No  Sun  Dinner     2    
1      10.34  1.66    Male     No  Sun  Dinner     3    
2      21.01  3.50    Male     No  Sun  Dinner     3    
3      23.68  3.31    Male     No  Sun  Dinner     2    
4      24.59  3.61  Female     No  Sun  Dinner     4    
<class 'pandas.core.frame.DataFrame'>    
RangeIndex: 244 entries, 0 to 243    
Data columns (total 7 columns):    
 #   Column      Non-Null Count  Dtype      
---  ------      --------------  -----      
 0   total_bill  244 non-null    object     
 1   tip         244 non-null    float64    
 2   sex         244 non-null    object     
 3   smoker      244 non-null    object     
 4   day         244 non-null    object     
 5   time        244 non-null    object     
 6   size        244 non-null    int64 

So, this will fail because of that one XXX in the first row of data where a number should be:

tips['tip_pct'] = tips['tip'] / (tips['total_bill'] - tips['tip'])

How do I rewrite the above line to filter out the bad row, without actually changing the contents of the DataFrame?

CodePudding user response:

You can wrap the column that has the 'xxx' in pd.to_numeric using errors='coerce'. This will convert string type values to NaN so your operation can happen and your dataframe will be unchanged

tips['tip_pct'] = tips['tip'] / (pd.to_numeric(tips['total_bill'],errors='coerce') - tips['tip'])

  total_bill   tip     sex smoker  day time  size  Unnamed: 4   tip_pct
0        xxx  1.01  Female     No  Sun     Dinner           2       NaN
1      10.34  1.66    Male     No  Sun     Dinner           3  0.191244
2      21.01  3.50    Male     No  Sun     Dinner           3  0.199886
3      23.68  3.31    Male     No  Sun     Dinner           2  0.162494
4      24.59  3.61  Female     No  Sun     Dinner           4  0.172069

CodePudding user response:

Another way, mask, coerce total_bill to float and compute

m=tips['total_bill']!='xxx'
tips['tip_pct'] =tips.loc[m,'tip'] / (tips.loc[m,'total_bill'].astype(float) - tips.loc[m,'tip'])




   total_bill   tip     sex smoker  day    time  size   tip_pct
0        xxx  1.01  Female     No  Sun  Dinner     2       NaN
1      10.34  1.66    Male     No  Sun  Dinner     3  0.191244
2      21.01  3.50    Male     No  Sun  Dinner     3  0.199886
3      23.68  3.31    Male     No  Sun  Dinner     2  0.162494
4      24.59  3.61  Female     No  Sun  Dinner     4  0.172069

CodePudding user response:

From read_csv

data = pd.read_csv('tips.csv',
   
    dtype={'total_bil': np.float64})

tips['tip_pct'] = tips['tip'] / (tips['total_bill'] - tips['tip'])
  • Related