Home > Blockchain >  Why does the variable name in my imported dataset have random characters added to it?
Why does the variable name in my imported dataset have random characters added to it?

Time:09-29

It's a bit tricky to provide a reproducible example, but my issue is that I am bringing in a dataset of British Geological Society mineral production data (after some slight tidying in Excel).

Specifically, from this website I'm downloading a .xlsx of production data for the diamond commodity from 2010 - 2020 by all countries.

When I bring it into R via readxl, the first column name has random characters appended to it -- so instead of Country, it reads \r\n\tCountry.

From poking around at other answers, I feel like this could be some strange SQL artifact in the original dataset (?), and I read that you could set readxl with specific encoding to prevent this, but it seems like readxl doesn't take that argument anymore.

My code:

library(tidyverse)
library(readxl)

# Set working directory.--------------------------------------------------------

setwd("Filepath/Project")

# Load data.--------------------------------------------------------------------

df <- read_xlsx("Filepath/Diamond Datasheet.xlsx")

And the head(df) output:

# A tibble: 6 x 12
  `\r\n\tCountry`   `2010`   `2011`   `2012`   `2013`   `2014`   `2015`   `2016`   `2017`  `2018` `2019` `2020`
  <chr>              <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>  <dbl>  <dbl>
1 Angola           8362139  8328519  8330996  8601696  8791340  9018942  9021767  9438802  8.41e6 9.15e6 7.73e6
2 Australia        9997752  7561487  8625996 11481749  9288118 13560795 13958000 17135000  1.40e7 1.22e7 9.98e6
3 Botswana        22019000 22903000 20478000 22597000 24658000 20824000 20954000 22900000  2.74e7 2.37e7 1.69e7
4 Brazil             25394    45536    46292    49166    56923    31826   183500   254896  2.51e5 1.66e5 1.25e5
5 Cameroon            6000     6000     5000     5000     6000     4500     3000     3500  3.6 e3 3.5 e3 4.2 e3
6 Canada          11773000 10795000 10529215 10561600 12082000 11677472 11103500 23198761  2.28e7 1.85e7 1.50e7

The \r\n\tCountry bit is the weird part to me. Many thanks for any advice.

CodePudding user response:

Not sure what the issue is. The name of the variable is probably not proper for R to read in. Judging by the variable names this is the case, as you can see the years are quoted as 'Year' instead of Year. Variables that start with numbers or have odd spacing cause R to poop itself. Thankfully you can easily change the name with the rename function in the tidyverse package. As an example:

#### Library ####
library(tidyverse)

#### Make Column of Country Data ####
df <- data.frame(Weird.Name = c("America",
                                "Japan",
                                "Russia"))
df

This data frame also has a poor name:

  Weird.Name
1    America
2      Japan
3     Russia

#### Rename ####
df %>% 
  rename(Country = Weird.Name)

Now it doesnt :)

  Country
1 America
2   Japan
3  Russia

Just as a sidenote...variable names are best if they are in this format before importing: Variable.Name or Variable_Name.

  • Related