This is in the context of reading .csv
files. I want to have a set of "required" columns that pandas will read and parse and a set of "optional" columns that pandas will extract if it is in the .csv
file or else it won't extract it (I am looping through many different .csv
files and only want to get the ones that meet these criteria).
Let's say I have data as below:
required_col1 | required_col2 | optional_col1 | optional_col2 | |
---|---|---|---|---|
0 | 1 | 4 | a | d |
1 | 2 | 5 | b | e |
2 | 3 | 6 | c | f |
I would like pandas to require a dataset to have required_col1
, required_col2
, and if optional_col1
is in the dataset, read it, and if it is not, do not raise an error and just move on; the same should be the same for optional_col2
. However, if the required columns are not in the dataset, I want pandas to raise an error.
I have attempted to use a lambda
function as such:
columnsList = ['required_col1', 'required_col2', 'optional_col1', 'optional_col2']
data = pd.read_csv('file.csv', usecols=lambda x: x in columnsList)
But, this doesn't work, since it treats all list values as optional – so if a dataset has 'optional_col1' and 'required_col2' it will still run through without raising an error (even though 'required_col1' is not in the data).
I've even tried this:
optionalList = ['optional_col1', 'optional_col2']
columnsList = ['required_col1', 'required_col2', lambda x: x in optionalList]
data = pd.read_csv('file.csv', usecols=columnsList)
But, I don't even know if this makes sense on a Python level because I'm unsure if the lambda function will even be accessed when running this through (it wasn't when I tried).
Is there any way that I could get pandas to extract optional columns if there, but also have a set of columns that it always should extract from a dataset? Thank you – I appreciate all the help!
CodePudding user response:
I'm still unsure if this specific question has an answer within the realms of the pandas API, but I've figured out a way to solve the problem. I've mentioned in the comments that I am trying to run a for loop reading in different datasets running this condition for usecols
.
As @Michael S. suggested, try
and except
is the way to go here.
First, run through the dataset with the required columns, and then append the dataset name to a list.
# list of required columns
requiredList = ['required_col1', 'required_col2']
finalDatasets = []
for i, dataName in enumerate(database)):
try:
data = pd.read_csv(dataName, usecols=requiredList)
except:
continue
else:
finalDatasets.append(dataName)
Since this new list has gotten rid of the datasets that do not have the required columns, you can now run the lambda
function to parse through optional columns:
columnsList = ['required_col1', 'required_col2', 'optional_col1', 'optional_col2']
for i, dataName in enumerate(finalDatasets):
# read in data
try:
data = pd.read_csv(dataName, usecols=lambda x: x in columnsList)
except:
continue
else:
{insert analysis}
This way, you are able to parse through the datasets that have the required columns, but you are unsure if they have optional columns (this is where lambda
comes in handy).