It's a bit tricky to provide a reproducible example, but my issue is that I am bringing in a dataset of British Geological Society mineral production data (after some slight tidying in Excel).
Specifically, from this website I'm downloading a .xlsx of production data for the diamond commodity from 2010 - 2020 by all countries.
When I bring it into R via readxl, the first column name has random characters appended to it -- so instead of Country
, it reads \r\n\tCountry
.
From poking around at other answers, I feel like this could be some strange SQL artifact in the original dataset (?), and I read that you could set readxl with specific encoding to prevent this, but it seems like readxl doesn't take that argument anymore.
My code:
library(tidyverse)
library(readxl)
# Set working directory.--------------------------------------------------------
setwd("Filepath/Project")
# Load data.--------------------------------------------------------------------
df <- read_xlsx("Filepath/Diamond Datasheet.xlsx")
And the head(df) output:
# A tibble: 6 x 12
`\r\n\tCountry` `2010` `2011` `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Angola 8362139 8328519 8330996 8601696 8791340 9018942 9021767 9438802 8.41e6 9.15e6 7.73e6
2 Australia 9997752 7561487 8625996 11481749 9288118 13560795 13958000 17135000 1.40e7 1.22e7 9.98e6
3 Botswana 22019000 22903000 20478000 22597000 24658000 20824000 20954000 22900000 2.74e7 2.37e7 1.69e7
4 Brazil 25394 45536 46292 49166 56923 31826 183500 254896 2.51e5 1.66e5 1.25e5
5 Cameroon 6000 6000 5000 5000 6000 4500 3000 3500 3.6 e3 3.5 e3 4.2 e3
6 Canada 11773000 10795000 10529215 10561600 12082000 11677472 11103500 23198761 2.28e7 1.85e7 1.50e7
The \r\n\tCountry
bit is the weird part to me. Many thanks for any advice.
CodePudding user response:
Not sure what the issue is. The name of the variable is probably not proper for R to read in. Judging by the variable names this is the case, as you can see the years are quoted as 'Year' instead of Year. Variables that start with numbers or have odd spacing cause R to poop itself. Thankfully you can easily change the name with the rename
function in the tidyverse
package. As an example:
#### Library ####
library(tidyverse)
#### Make Column of Country Data ####
df <- data.frame(Weird.Name = c("America",
"Japan",
"Russia"))
df
This data frame also has a poor name:
Weird.Name
1 America
2 Japan
3 Russia
#### Rename ####
df %>%
rename(Country = Weird.Name)
Now it doesnt :)
Country
1 America
2 Japan
3 Russia
Just as a sidenote...variable names are best if they are in this format before importing: Variable.Name
or Variable_Name
.