Is there a way, without reading the file twice, to check if a column exists otherwise use column names passed? I have files of the same structure but some do not contain a header for some reason.
Example with header:
Field1 Field2 Field3
data1 data2 data3
Example without header:
data1 data2 data3
When trying to use the example below, if the file has a header it will make it the first row instead of replacing the header.
pd.read_csv('filename.csv', names=col_names)
When trying to use the below, it will drop the first row of data of there is no header in the file.
pd.read_csv('filename.csv', header=0, names=col_names)
My current work around is to load the file, check if the columns exist or not, then if it doesn't read the file again.
df = pd.read_csv('filename.csv')
if `Field1` not in df.columns:
del df
df = pd.read_csv('filename.csv', names=col_names)
Is there a better way to handle this data set that doesn't involve potentially reading the file twice?
CodePudding user response:
Just modify your logic so the first time through only reads the first row:
# Load first row and setup keyword args if necessary
kw_args = {}
first = pd.read_csv('filename.csv', nrows=1)
if `Field1` not in first.columns:
kw_args["names"] = col_names
# Load data
df = pd.read_csv('filename.csv', **kw_args)
CodePudding user response:
You can do this with seek
method of file descriptor:
with open('filename.csv') as csvfile:
headers = pd.read_csv(csvfile, nrows=0).columns.tolist()
csvfile.seek(0) # return file pointer to the beginning of the file
# do stuff here
if 'Field1' in headers:
...
else:
...
df = pd.read_csv(csvfile, ...)
The file is read only once.