Home > OS >  Remove special characters from dataframe names
Remove special characters from dataframe names

Time:01-21

In Python, I'm reading an excel file with multiple sheets, with the intention of each sheet being its own dataframe:

df = pd.read_excel('Book1.xlsx', sheet_name=None)

So to get the dictionary keys to each dataframe (or sheet) I can use: df.keys() which gives me each sheet name from the original Excel file: dict_keys(['GF-1', 'H_2 S-Z', 'GB-SF NZ'])

I can then assign each dictionary into its own dataframe using:

for key in df.keys():
    globals()[key] = df[key]

But, because the sheet names from the original Excel file contain special characters ( -, spaces, etc), I can't call up any of the dataframes individually:

H_2 S-Z.head()
        ^
SyntaxError: invalid syntax

I know that dataframe 'names' cannot contain special characters or start with numbers etc, so how do I remove those special characters? I don't think the dict_keys can be edited (e.g. using regex). Also thought about creating a list of the dataframes, then perhaps doing a regex for loop to iterate over each dataframe name, but not sure that it would assign the 'new' dataframe name back to each dataframe.

Can anyone help me?

CodePudding user response:

You can use re.sub with a dictcomp to get rid of the characters (-, , whitespace, ..) :

import re

dict_dfs = pd.read_excel("Book1.xlsx", sheet_name=None)

dict_dfs = {re.sub(r"[- \s]", "_", k): v for k,v in dict_dfs.items()} 

for key in dict_dfs.keys():
    globals()[key] = dict_dfs[key]

As suggested by @cottontail, you can also use re.sub(r"\W", "_", k).

NB: As a result (in the global scope), you'll have as much variables (pandas.core.frame.DataFrame objects) as there is worksheets in your Excel file.

print([(var, type(val)) for var, val in globals().items()
       if type(val) == pd.core.frame.DataFrame])

#[('GF-1', pandas.core.frame.DataFrame),
# ('H_2_S_Z', pandas.core.frame.DataFrame),
# ('GB_SF_NZ', pandas.core.frame.DataFrame)]

CodePudding user response:

globals() is already a dictionary (you can confirm by isinstance(globals(), dict)), so the individual sheets can be accessed as any dict value:

globals()['H_2 S-Z'].head()

etc.

That being said, instead of creating individually named dataframes, I would think that storing the sheets as dataframes in a single dictionary may be more readable and accessible for you down the road. It's already creating problems given you cannot name your dataframes with the same name as the sheet names. If you change the dataframe names, then you'll need another mapping that tells you which sheet name corresponds to which dataframe name, so it's a lot of work tbh. As you already have a dictionary of dataframes in df, why not access the individual sheets by df['H_2 S-Z'] etc.?

  • Related