I use the same piece of code which I use to import multiple dataframes. Usually the have the same column names with different data. However sometimes they have different spaces before or after the names of the columns.
df = pd.read_csv(
file_path,
delimiter="|",
low_memory=True,
dtype=schema,
usecols=schema.keys(),
)
The schema of the file is in a different file:
file_schema = {
" Age ": str,
" Name ": str,
" Country ": str,}
for some other cases, there are no spaces before and after the names:
file_schema = {
"Age": str,
"Name": str,
"Country": str,}
Currently with having one schema, if there is no match in the spaces before the name of the columns, I'm having errors related to usecols
.
I'm wondering if there's a way in one schema file to write the names of the columns and for it to work no matter how many spaces we have before or after the names?
CodePudding user response:
I think it should be possible to match the column names with
pd.read_csv(..., usecols=lambda x: x.strip() in schema.keys())
and then either strip them afterwards with
df.columns = df.columns.str.strip()
or even better try to pass them explicitly with
pd.read_csv(..., header=0, names=schema.keys())
if you know that all columns declared in schema
will be in the file and in order.
Not sure, whether dtype=schema
will cause the next problems immediatlely, though