Home > Net >  Pandas use column names if do not exist
Pandas use column names if do not exist

Time:11-30

Is there a way, without reading the file twice, to check if a column exists otherwise use column names passed? I have files of the same structure but some do not contain a header for some reason.

Example with header:

Field1 Field2 Field3
data1  data2  data3

Example without header:

data1  data2  data3

When trying to use the example below, if the file has a header it will make it the first row instead of replacing the header.

pd.read_csv('filename.csv', names=col_names)

When trying to use the below, it will drop the first row of data of there is no header in the file.

pd.read_csv('filename.csv', header=0, names=col_names)

My current work around is to load the file, check if the columns exist or not, then if it doesn't read the file again.

df = pd.read_csv('filename.csv')
if `Field1` not in df.columns:
    del df
    df = pd.read_csv('filename.csv', names=col_names)

Is there a better way to handle this data set that doesn't involve potentially reading the file twice?

CodePudding user response:

Just modify your logic so the first time through only reads the first row:

# Load first row and setup keyword args if necessary
kw_args = {}
first = pd.read_csv('filename.csv', nrows=1)
if `Field1` not in first.columns:
    kw_args["names"] = col_names
# Load data
df = pd.read_csv('filename.csv', **kw_args)

CodePudding user response:

You can do this with seek method of file descriptor:

with open('filename.csv') as csvfile:
    headers = pd.read_csv(csvfile, nrows=0).columns.tolist()
    csvfile.seek(0)  # return file pointer to the beginning of the file

    # do stuff here
    if 'Field1' in headers:
       ...
    else:
       ...

    df = pd.read_csv(csvfile, ...)

The file is read only once.

  • Related