Home > Software design >  Dropping rows based on a string in a table
Dropping rows based on a string in a table

Time:11-21

Code to drop rows based on a partial string is not working.

Very simple code, and it runs fine but doesn't drop the rows I want.

The original table in the pdf looks like this:

Chemical Value Unit Type
Fluoride 0.23 ug/L Lab
Mercury 0.15 ug/L Lab
Sum of Long Chained Polymers 0.33
Partialsum of Short Chained Polymers 0.40

What I did:

import csv 
import tabula

dfs = tabula.read _pdf("Test.pdf", pages= 'all')
file = "Test.pdf"
tables = tabula.read_pdf(file, pages=2, stream=True, multiple_tables=True)

table1 = tables[1]
table1.drop('Unit', axis=1, inplace=True) 
table1.drop('Type', axis=1, inplace=True)
discard = ['sum','Sum']
table1[~table1.Chemical.str.contains('|'.join(discard))]
print(table1)
table1.to_csv('test.csv')

The results are that it drops the 2 columns I don't want, so that's fine. But it did not delete the rows with the words "sum" or "Sum" in them. Any insights?

CodePudding user response:

You are close. You did drop the rows, but you didn't save the result.

import pandas as pd

example = {'Chemical': ['Fluoride', 'Mercury', 'Sum of Long Chained Polymers',
                'Partialsum of Short Chained Polymers'], 
            'Value': [0.23, 0.15, 0.33, 0.4], 
            'Unit': ['ug/L', 'ug/L', '', ''], 
            'Type': ['Lab', 'Lab', '', '']}

table1 = pd.DataFrame(example)
table1.drop('Unit', axis=1, inplace=True)
table1.drop('Type', axis=1, inplace=True)
discard = ['sum','Sum']
table1 = table1[~table1.Chemical.str.contains('|'.join(discard))]
print(table1)

CodePudding user response:

You can use pd.Series.str.contains with the argument case=False to ignore case:

Also, it's not law, but often considered poor practice to use inplace=True... because in part it leads to confusions like the one you're experiencing.

Given df:

                               Chemical  Value  Unit  Type
0                              Fluoride   0.23  ug/L   Lab
1                               Mercury   0.15  ug/L   Lab
2          Sum of Long Chained Polymers   0.33   NaN   NaN
3  Partialsum of Short Chained Polymers   0.40   NaN   NaN

Doing:

df = (df.drop(['Unit', 'Type'], axis=1)
        .loc[~df.Chemical.str.contains('sum', case=False)])

Output:

   Chemical  Value
0  Fluoride   0.23
1   Mercury   0.15
  • Related