Pandas use column names if do not exist-CodePudding

Is there a way, without reading the file twice, to check if a column exists otherwise use column names passed? I have files of the same structure but some do not contain a header for some reason.

Example with header:

Field1 Field2 Field3
data1  data2  data3

Example without header:

data1  data2  data3

When trying to use the example below, if the file has a header it will make it the first row instead of replacing the header.

pd.read_csv('filename.csv', names=col_names)

When trying to use the below, it will drop the first row of data of there is no header in the file.

pd.read_csv('filename.csv', header=0, names=col_names)

My current work around is to load the file, check if the columns exist or not, then if it doesn't read the file again.

df = pd.read_csv('filename.csv')
if `Field1` not in df.columns:
    del df
    df = pd.read_csv('filename.csv', names=col_names)

Is there a better way to handle this data set that doesn't involve potentially reading the file twice?

CodePudding user response：

Just modify your logic so the first time through only reads the first row:

# Load first row and setup keyword args if necessary
kw_args = {}
first = pd.read_csv('filename.csv', nrows=1)
if `Field1` not in first.columns:
    kw_args["names"] = col_names
# Load data
df = pd.read_csv('filename.csv', **kw_args)

CodePudding user response：

You can do this with seek method of file descriptor:

with open('filename.csv') as csvfile:
    headers = pd.read_csv(csvfile, nrows=0).columns.tolist()
    csvfile.seek(0)  # return file pointer to the beginning of the file

    # do stuff here
    if 'Field1' in headers:
       ...
    else:
       ...

    df = pd.read_csv(csvfile, ...)

The file is read only once.