What I'm doing:
I am reading a csv file into python using pandas. A valid entry will always start with a county.
The issue
The issue I'm having is that there is some issue with the csv file that's invalidating 1.3k lines. Some lines end with too many commas , some lines start with nonsense, some lines have additional BS, some lines have been returned halfway through the entry which causes the next line to be nonsense etc
My question
As I've previously stated, each valid line should start with one of the entries contained in counties. So I thought there might be some way to validate the first entry of each line and check if that entry 'is in counties'. If it is, add it. If not, ignore it.
Is it possible to use one of the arguments within the .read_csv function to overcome the issue? Is there some magic I can do with a lambda function to pull this off, (like I tried above)? Maybe you could suggest a better way?
Please do not suggest:
Modifying the CSV file by hand / directly. Yes I know I can do that but it's not what I'm asking & is not reasonable or viable for large datasets or if calling requests. My goal here is to learn & tackle the issue pythonicly
What I've tried:
I thought trying this would work:
counties = ["Bedfordshire", "Berkshire", "Bristol", ... , "Worcestershire"]
df = pd.read_csv('data/libraries.csv', skiprows=(lambda x: x not in counties))
but alas not.
I have also tried using the built in arguments to solve this issue, for eg lineterminator='\n', skipblanklines=True and a few other things to try to 'constrain' the read and prevent the error but this doesn't work
Small sample of csv:
Library service,Library name,In use 2010,In use 2016,Type of library,Type of closed library,Closed,New building,Replace existing,Notes,Weekly hours open,Weekly hours staffed
Barking and Dagenham,Barking,Yes,Yes,LAL,,,,,,72.0,72.0
Barking and Dagenham,Castle Green,Yes,No,,XL,Mar-13,,,Closed as a public library 31.3.2013. Secondary School library Jo Richardson School,,
Barking and Dagenham,Dagenham,No,Yes,LAL,,,Oct-10,Yes,Replaced Fanshawe and Rectory libraries,56.0,56.0
...
Surrey,Stanwell ,yes,yes,CRL,,,2012,Yes,Rebuilt on site of old library. Dec 2011 - Jan 2012 Closed 4 weeks to move to new library.,,
Opening hours reduced from 30 to 27 from 5 Sep 2016. Now closed at lunch (1pm-2pm),27.0,yes,,,,,,,,,
Surrey,Stoneleigh ,yes,yes,CRL,,,,,CRL from Feb 2013,0.0,0.0
Surrey,Sunbury ,yes,yes,LAL,,,,,Opening hours reduced from 42 to 39.5 from 5 Sep 2016,39.5,39.5
Surrey,Tattenhams ,yes,yes,CRL,,,,,April 2010 closed for 5 weks for building work.,,
CRL from Nov 2012,0.0,no,,,,,,,,,
...
Worcestershire,Wythall Library,yes,no,,XLR,2016,,,,,
Worcestershire,Wythall Library,yes,yes,LAL,,,2016,"yes, replaced Wythall library",,31.5,31.5
York,Acomb,yes,yes,CL,,,,,,58.5,58.5
York,Bishopthorpe,yes,yes,CL,,,,,,21.5,21.5
In the middle section you can see that one line starts with "Opening hours reduced from..." another starts with "CRL from Nov...". These are the artefacts that I assume are causing massive issues
CodePudding user response:
You could use a wrapper to filter the lines before feeding to read_csv
:
from io import StringIO
def filter_valid(fname, valid):
def wrapper(fname, valid):
first = True
with open(fname) as f:
for line in f:
if first or line.split(',', 1)[0] in valid:
yield line
first = False
return StringIO(''.join(wrapper(fname, valid)))
df = pd.read_csv(filter_valid('libraries.csv', set(counties)), on_bad_lines='skip')
print(df)
Output:
Library service Library name In use 2010 In use 2016 Type of library Type of closed library Closed New building Replace existing Notes Weekly hours open Weekly hours staffed
0 Surrey Stanwell yes yes CRL NaN NaN 2012.0 Yes Rebuilt on site of old library. Dec 2011 - Jan 2012 Closed 4 weeks to move to new library. NaN NaN
1 Surrey Stoneleigh yes yes CRL NaN NaN NaN NaN CRL from Feb 2013 0.0 0.0
2 Surrey Sunbury yes yes LAL NaN NaN NaN NaN Opening hours reduced from 42 to 39.5 from 5 Sep 2016 39.5 39.5
3 Surrey Tattenhams yes yes CRL NaN NaN NaN NaN April 2010 closed for 5 weks for building work. NaN NaN
4 Worcestershire Wythall Library yes no NaN XLR 2016.0 NaN NaN NaN NaN NaN
5 Worcestershire Wythall Library yes yes LAL NaN NaN 2016.0 yes, replaced Wythall library NaN 31.5 31.5
CodePudding user response:
Thanks to @Barmar for the suggestion I can do:
# Create an empty list to store the valid rows
valid_rows = []
# Open the CSV file and read the rows
with open("data/libraries.csv", "r") as file:
reader = csv.reader(file)
header = True
for row in reader:
if header:
columns = row
header = False
elif row[0] in counties:
valid_rows.append(row)
df = pd.DataFrame(valid_rows, columns=columns)
Another thing I realised I could do was the following
# Define the custom function to apply to the first column
def validate_county(value):
if value in counties:
return value
else:
return pd.NaT
df = pd.read_csv('data/libraries.csv', converters={0: validate_county})
However, this only adds a NaT entry to the first column & doesn't prevent the entire entry from being entered into the dataframe therefore bypassing the errors caused by the csv. No error_bad_lines=False cannot be used and doesn't prevent lines from being skipped. Maybe someone has a suggestion on the latter I was working on before Barmers suggestion however the former works as intended