Home > Mobile >  What's the best way to work with datasets that contain special characters in their column names
What's the best way to work with datasets that contain special characters in their column names

Time:12-09

I am working with some large datasets that contain special characters in their column names. The column names look something like: "@c_age1619_da * ((df.age >= 16) & (df.age <= 19))" or "sovtoll_available == False". What would be the best way to work with these names? Should I keep the names as they are or rename them to more R-friendly names? When I call them in cases like df$value, R mistakenly interprets the column name as a function!

CodePudding user response:

The only advantage to keeping the non-standard names is if you want to use those as labels in a plot or table or something. But it will make it very hard to work with the data, and those names could be reintroduced as labels later. You can use non-standard names by putting them in backticks, e.g.,

df$`@c_age1619_da`

Some editors (like RStudio) will correctly auto-complete these non-standard names, making them somewhat easier to work with, but still not as nice as standard names.

Renaming them to standard names is generally better. Many functions that read-in data will do this automatically. You can use the make.names function to convert the non-standard names to standard names, mostly by replacing any special characters with .s. Like this:

names(my_data) = make.names(names(my_data))

But generally the best is to make meaningful names manually. sovtoll_available....False isn't very friendly name either, compared to something like sovtoll_unavailable.

  • Related