What I'm doing:

I am reading a csv file into python using pandas. A valid entry will always start with a county.

The issue

The issue I'm having is that there is some issue with the csv file that's invalidating 1.3k lines. Some lines end with too many commas , some lines start with nonsense, some lines have additional BS, some lines have been returned halfway through the entry which causes the next line to be nonsense etc

My question

As I've previously stated, each valid line should start with one of the entries contained in counties. So I thought there might be some way to validate the first entry of each line and check if that entry 'is in counties'. If it is, add it. If not, ignore it.

Is it possible to use one of the arguments within the .read_csv function to overcome the issue? Is there some magic I can do with a lambda function to pull this off, (like I tried above)? Maybe you could suggest a better way?

Please do not suggest:

Modifying the CSV file by hand / directly. Yes I know I can do that but it's not what I'm asking & is not reasonable or viable for large datasets or if calling requests. My goal here is to learn & tackle the issue pythonicly

What I've tried:

I thought trying this would work:

counties = ["Bedfordshire", "Berkshire", "Bristol", ... , "Worcestershire"]

df = pd.read_csv('data/libraries.csv', skiprows=(lambda x: x not in counties))

but alas not.

I have also tried using the built in arguments to solve this issue, for eg lineterminator='\n', skipblanklines=True and a few other things to try to 'constrain' the read and prevent the error but this doesn't work

Small sample of csv:

Library service,Library name,In use 2010,In use 2016,Type of library,Type of closed library,Closed,New building,Replace existing,Notes,Weekly hours open,Weekly hours staffed
Barking and Dagenham,Barking,Yes,Yes,LAL,,,,,,72.0,72.0
Barking and Dagenham,Castle Green,Yes,No,,XL,Mar-13,,,Closed as a public library 31.3.2013. Secondary School library Jo Richardson School,,
Barking and Dagenham,Dagenham,No,Yes,LAL,,,Oct-10,Yes,Replaced Fanshawe and Rectory libraries,56.0,56.0
...
Surrey,Stanwell  ,yes,yes,CRL,,,2012,Yes,Rebuilt on site of old library. Dec 2011 - Jan 2012 Closed 4 weeks to move to new library.,,
Opening hours reduced from 30 to 27 from 5 Sep 2016. Now closed at lunch (1pm-2pm),27.0,yes,,,,,,,,,
Surrey,Stoneleigh ,yes,yes,CRL,,,,,CRL from Feb 2013,0.0,0.0
Surrey,Sunbury ,yes,yes,LAL,,,,,Opening hours reduced from 42 to 39.5 from 5 Sep 2016,39.5,39.5
Surrey,Tattenhams ,yes,yes,CRL,,,,,April 2010 closed for 5 weks for building work.,,
CRL from Nov 2012,0.0,no,,,,,,,,,
...
Worcestershire,Wythall Library,yes,no,,XLR,2016,,,,,
Worcestershire,Wythall Library,yes,yes,LAL,,,2016,"yes, replaced Wythall library",,31.5,31.5
York,Acomb,yes,yes,CL,,,,,,58.5,58.5
York,Bishopthorpe,yes,yes,CL,,,,,,21.5,21.5

In the middle section you can see that one line starts with "Opening hours reduced from..." another starts with "CRL from Nov...". These are the artefacts that I assume are causing massive issues

CodePudding user response：

You could use a wrapper to filter the lines before feeding to read_csv:

from io import StringIO

def filter_valid(fname, valid):
    def wrapper(fname, valid):
        first = True
        with open(fname) as f:
            for line in f:
                if first or line.split(',', 1)[0] in valid:
                    yield line
                first = False
    return StringIO(''.join(wrapper(fname, valid)))

df = pd.read_csv(filter_valid('libraries.csv', set(counties)), on_bad_lines='skip')

print(df)

Output:

  Library service     Library name In use 2010 In use 2016 Type of library Type of closed library  Closed  New building               Replace existing                                                                                       Notes  Weekly hours open  Weekly hours staffed
0          Surrey       Stanwell           yes         yes             CRL                    NaN     NaN        2012.0                            Yes  Rebuilt on site of old library. Dec 2011 - Jan 2012 Closed 4 weeks to move to new library.                NaN                   NaN
1          Surrey      Stoneleigh          yes         yes             CRL                    NaN     NaN           NaN                            NaN                                                                           CRL from Feb 2013                0.0                   0.0
2          Surrey         Sunbury          yes         yes             LAL                    NaN     NaN           NaN                            NaN                                       Opening hours reduced from 42 to 39.5 from 5 Sep 2016               39.5                  39.5
3          Surrey      Tattenhams          yes         yes             CRL                    NaN     NaN           NaN                            NaN                                             April 2010 closed for 5 weks for building work.                NaN                   NaN
4  Worcestershire  Wythall Library         yes          no             NaN                    XLR  2016.0           NaN                            NaN                                                                                         NaN                NaN                   NaN
5  Worcestershire  Wythall Library         yes         yes             LAL                    NaN     NaN        2016.0  yes, replaced Wythall library                                                                                         NaN               31.5                  31.5

CodePudding user response：

Thanks to @Barmar for the suggestion I can do:

# Create an empty list to store the valid rows
valid_rows = []

# Open the CSV file and read the rows
with open("data/libraries.csv", "r") as file:
    reader = csv.reader(file)
    header = True
    for row in reader:
        if header:
            columns = row
            header = False
        elif row[0] in counties:
            valid_rows.append(row)

df = pd.DataFrame(valid_rows, columns=columns)

Another thing I realised I could do was the following

# Define the custom function to apply to the first column
def validate_county(value):
    if value in counties:
        return value
    else:
        return pd.NaT
    
df = pd.read_csv('data/libraries.csv', converters={0: validate_county})

However, this only adds a NaT entry to the first column & doesn't prevent the entire entry from being entered into the dataframe therefore bypassing the errors caused by the csv. No error_bad_lines=False cannot be used and doesn't prevent lines from being skipped. Maybe someone has a suggestion on the latter I was working on before Barmers suggestion however the former works as intended