Is there a way to read required AND optional columns with usecols in pandas?-CodePudding

This is in the context of reading .csv files. I want to have a set of "required" columns that pandas will read and parse and a set of "optional" columns that pandas will extract if it is in the .csv file or else it won't extract it (I am looping through many different .csv files and only want to get the ones that meet these criteria).

Let's say I have data as below:

	required_col1	required_col2	optional_col1	optional_col2
0	1	4	a	d
1	2	5	b	e
2	3	6	c	f

I would like pandas to require a dataset to have required_col1, required_col2, and if optional_col1 is in the dataset, read it, and if it is not, do not raise an error and just move on; the same should be the same for optional_col2. However, if the required columns are not in the dataset, I want pandas to raise an error.

I have attempted to use a lambda function as such:

columnsList = ['required_col1', 'required_col2', 'optional_col1', 'optional_col2']
data = pd.read_csv('file.csv', usecols=lambda x: x in columnsList)

But, this doesn't work, since it treats all list values as optional – so if a dataset has 'optional_col1' and 'required_col2' it will still run through without raising an error (even though 'required_col1' is not in the data).

I've even tried this:

optionalList = ['optional_col1', 'optional_col2']
columnsList = ['required_col1', 'required_col2', lambda x: x in optionalList]
data = pd.read_csv('file.csv', usecols=columnsList)

But, I don't even know if this makes sense on a Python level because I'm unsure if the lambda function will even be accessed when running this through (it wasn't when I tried).

Is there any way that I could get pandas to extract optional columns if there, but also have a set of columns that it always should extract from a dataset? Thank you – I appreciate all the help!

CodePudding user response：

I'm still unsure if this specific question has an answer within the realms of the pandas API, but I've figured out a way to solve the problem. I've mentioned in the comments that I am trying to run a for loop reading in different datasets running this condition for usecols.

As @Michael S. suggested, try and except is the way to go here.

First, run through the dataset with the required columns, and then append the dataset name to a list.

# list of required columns
requiredList = ['required_col1', 'required_col2']
finalDatasets = []
for i, dataName in enumerate(database)):
    try:
        data = pd.read_csv(dataName, usecols=requiredList)
    except:
        continue
    else:
        finalDatasets.append(dataName)

Since this new list has gotten rid of the datasets that do not have the required columns, you can now run the lambda function to parse through optional columns:

columnsList = ['required_col1', 'required_col2', 'optional_col1', 'optional_col2']
for i, dataName in enumerate(finalDatasets):
    # read in data
    try:
        data = pd.read_csv(dataName, usecols=lambda x: x in columnsList)
    except:
        continue
    else:
        {insert analysis}

This way, you are able to parse through the datasets that have the required columns, but you are unsure if they have optional columns (this is where lambda comes in handy).