How to filter rows based on cell contents in a row-based expression-CodePudding

I read some data from a file. The first column is assigned 'object' type because of the XXX in the very first data row:

tips = pd.read_csv("tips.csv")
print(tips.head())
print(tips.info())

total_bill   tip     sex smoker  day    time  size    
0        xxx  1.01  Female     No  Sun  Dinner     2    
1      10.34  1.66    Male     No  Sun  Dinner     3    
2      21.01  3.50    Male     No  Sun  Dinner     3    
3      23.68  3.31    Male     No  Sun  Dinner     2    
4      24.59  3.61  Female     No  Sun  Dinner     4    
<class 'pandas.core.frame.DataFrame'>    
RangeIndex: 244 entries, 0 to 243    
Data columns (total 7 columns):    
 #   Column      Non-Null Count  Dtype      
---  ------      --------------  -----      
 0   total_bill  244 non-null    object     
 1   tip         244 non-null    float64    
 2   sex         244 non-null    object     
 3   smoker      244 non-null    object     
 4   day         244 non-null    object     
 5   time        244 non-null    object     
 6   size        244 non-null    int64

So, this will fail because of that one XXX in the first row of data where a number should be:

tips['tip_pct'] = tips['tip'] / (tips['total_bill'] - tips['tip'])

How do I rewrite the above line to filter out the bad row, without actually changing the contents of the DataFrame?

CodePudding user response：

You can wrap the column that has the 'xxx' in pd.to_numeric using errors='coerce'. This will convert string type values to NaN so your operation can happen and your dataframe will be unchanged

tips['tip_pct'] = tips['tip'] / (pd.to_numeric(tips['total_bill'],errors='coerce') - tips['tip'])

  total_bill   tip     sex smoker  day time  size  Unnamed: 4   tip_pct
0        xxx  1.01  Female     No  Sun     Dinner           2       NaN
1      10.34  1.66    Male     No  Sun     Dinner           3  0.191244
2      21.01  3.50    Male     No  Sun     Dinner           3  0.199886
3      23.68  3.31    Male     No  Sun     Dinner           2  0.162494
4      24.59  3.61  Female     No  Sun     Dinner           4  0.172069

CodePudding user response：

Another way, mask, coerce total_bill to float and compute

m=tips['total_bill']!='xxx'
tips['tip_pct'] =tips.loc[m,'tip'] / (tips.loc[m,'total_bill'].astype(float) - tips.loc[m,'tip'])




   total_bill   tip     sex smoker  day    time  size   tip_pct
0        xxx  1.01  Female     No  Sun  Dinner     2       NaN
1      10.34  1.66    Male     No  Sun  Dinner     3  0.191244
2      21.01  3.50    Male     No  Sun  Dinner     3  0.199886
3      23.68  3.31    Male     No  Sun  Dinner     2  0.162494
4      24.59  3.61  Female     No  Sun  Dinner     4  0.172069

CodePudding user response：

From read_csv

data = pd.read_csv('tips.csv',
   
    dtype={'total_bil': np.float64})

tips['tip_pct'] = tips['tip'] / (tips['total_bill'] - tips['tip'])