Home > OS >  How to solve problem with dataset's variables [closed]
How to solve problem with dataset's variables [closed]

Time:10-09

Im facing the following problem when im trying to run my regression model:

The two datasets are quite big to post them, so i will give a view of the "merged_data":

merged_data

GemData <- read_dta(("C:/Users/I/Documents//GEM Dataset.dta"))

GlobeData <- read_excel("GLOBE-Phase-2-Aggregated-Societal-Culture-Data.xls")

> dput(head(reference_iso))
structure(list(name = c("Afghanistan", "Aland Islands", "Albania", 
"Algeria", "American Samoa", "Andorra"), alpha.3 = c("AFG", "ALA", 
"ALB", "DZA", "ASM", "AND")), row.names = c(NA, 6L), class = "data.frame")
> merged_data <- GlobeData %>% 
    left_join(reference_iso, by = c('Country Name' = 'name')) %>% 
    rename(iso3 = 'alpha.3') %>% 
    left_join(GemData, by = c('iso3' = 'cntry') )
> model1 <- lm(all_high_stat_entre ~ Uncertainty Avoidance Societal Practices ,data=merged_data)
Error: unexpected symbol in "model1 <- lm(all_high_stat_entre ~ Uncertainty Avoidance"

Any advice for this error appearance ?

CodePudding user response:

As was mentioned, your "big variable names" cannot be referenced ad-hoc in a formula. While I don't know if this is right (pic of data does not include enough context), I suspect all you need to do is enclose all space-including variables in backticks, as in

model1 <- lm(all_high_stat_entre ~ `Uncertainty Avoidance Societal Practices`,
             data=merged_data)

Demonstration:

mt <- mtcars
names(mt)[2] <- "c yl"
head(mt, 3)
#      mpg  c yl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4
# 2:  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4
# 3:  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1

lm(mpg ~ c yl   disp, data = mtcars)
# Error: unexpected symbol in "lm(mpg ~ c yl"
# x

lm(mpg ~ `c yl`   disp, data = mt)
# Call:
# lm(formula = mpg ~ `c yl`   disp, data = mt)
# Coefficients:
# (Intercept)       `c yl`         disp  
#    34.66099     -1.58728     -0.02058  

Why?

Think of this from a language-parsing viewpoint: "tokens" that are literal numbers, variables, or functions must be delimited by something. In most cases, this needs to be an infix operator, a paren, or a comma.

Examples:

  • c(1 2) does not work since we want 1 and 2 to be distinct, so we use a comma.

  • mean 2 should be mean(2), where the paren separates them. We can optionally include spaces here, mean (2) and mean( 2) work just fine, so the spaces here are ignored.

  • if we have two variables x and y, then we can do x y or x y, where the infix clearly/obviously separates them.

In general, though, not many things (any?) in R are solely space-separated. 1 2, var1 var2, and similar are parsing errors. If we have a variable that has a space (or is otherwise not compliant with https://cran.r-project.org/doc/FAQ/R-FAQ.html#What-are-valid-names_003f), then we must inform R how to include the spaces, and that is typically done with backticks.

`a b` <- 1
a b
# Error: unexpected symbol in "a b"
# x
`a b`
# [1] 1

In some places, we can use quotes, but backticks also work.

zz <- setNames(list(11, 12), c("a b", "c d"))
zz$`a b`
# [1] 11
zz$"c d"
# [1] 12
zz[["c d"]]
# [1] 12
zz[[`c d`]]
# Error: object 'c d' not found

Noting that backticks are not always appropriate: in some locations, they push R to look for an object with that name. Had we done zz[[`a b`]] here, it would not have erred, but that's because in the previous code block I created a variable named `a b`, and that's what it would have found, then resolving it into zz[[1]] (and therefore 11).

Getting back to your case, your variable names have spaces in them. With many base R (and some packages) data-reading functions, they tend to have check.names= or a similarly-purposes argument that will convert a name of a b into a.b, but readxl::read_excel does not do that, so it allows the spaces. While I'm of mixed-opinion on which is the perfect option, I think having spaces enclosed in variable names is a risk for new users. I do like that read_excel returns a tibble, and the presentation of tibbles tends to include (for visual reference if nothing else) backticks around not-legal names. For instance,

readxl::read_excel("Book2.xlsx")
# # A tibble: 1 x 3
#   `a b` `c d`    ef
#   <dbl> <dbl> <dbl>
# 1    11    22    33

which is a clear visual cue that the first two variable names need backtick enclosures.

  • Related