How to solve problem with dataset's variables [closed]-CodePudding

Im facing the following problem when im trying to run my regression model:

The two datasets are quite big to post them, so i will give a view of the "merged_data":

GemData <- read_dta(("C:/Users/I/Documents//GEM Dataset.dta"))

GlobeData <- read_excel("GLOBE-Phase-2-Aggregated-Societal-Culture-Data.xls")

> dput(head(reference_iso))
structure(list(name = c("Afghanistan", "Aland Islands", "Albania", 
"Algeria", "American Samoa", "Andorra"), alpha.3 = c("AFG", "ALA", 
"ALB", "DZA", "ASM", "AND")), row.names = c(NA, 6L), class = "data.frame")
> merged_data <- GlobeData %>% 
    left_join(reference_iso, by = c('Country Name' = 'name')) %>% 
    rename(iso3 = 'alpha.3') %>% 
    left_join(GemData, by = c('iso3' = 'cntry') )
> model1 <- lm(all_high_stat_entre ~ Uncertainty Avoidance Societal Practices ,data=merged_data)
Error: unexpected symbol in "model1 <- lm(all_high_stat_entre ~ Uncertainty Avoidance"

Any advice for this error appearance ?

CodePudding user response：

As was mentioned, your "big variable names" cannot be referenced ad-hoc in a formula. While I don't know if this is right (pic of data does not include enough context), I suspect all you need to do is enclose all space-including variables in backticks, as in

model1 <- lm(all_high_stat_entre ~ `Uncertainty Avoidance Societal Practices`,
             data=merged_data)

Demonstration:

mt <- mtcars
names(mt)[2] <- "c yl"
head(mt, 3)
#      mpg  c yl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4
# 2:  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4
# 3:  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1

lm(mpg ~ c yl   disp, data = mtcars)
# Error: unexpected symbol in "lm(mpg ~ c yl"
# x

lm(mpg ~ `c yl`   disp, data = mt)
# Call:
# lm(formula = mpg ~ `c yl`   disp, data = mt)
# Coefficients:
# (Intercept)       `c yl`         disp  
#    34.66099     -1.58728     -0.02058

Why?

Think of this from a language-parsing viewpoint: "tokens" that are literal numbers, variables, or functions must be delimited by something. In most cases, this needs to be an infix operator, a paren, or a comma.

Examples:

c(1 2) does not work since we want 1 and 2 to be distinct, so we use a comma.
mean 2 should be mean(2), where the paren separates them. We can optionally include spaces here, mean (2) and mean( 2) work just fine, so the spaces here are ignored.
if we have two variables x and y, then we can do x y or x y, where the infix clearly/obviously separates them.

In general, though, not many things (any?) in R are solely space-separated. 1 2, var1 var2, and similar are parsing errors. If we have a variable that has a space (or is otherwise not compliant with https://cran.r-project.org/doc/FAQ/R-FAQ.html#What-are-valid-names_003f), then we must inform R how to include the spaces, and that is typically done with backticks.

`a b` <- 1
a b
# Error: unexpected symbol in "a b"
# x
`a b`
# [1] 1

In some places, we can use quotes, but backticks also work.

zz <- setNames(list(11, 12), c("a b", "c d"))
zz$`a b`
# [1] 11
zz$"c d"
# [1] 12
zz[["c d"]]
# [1] 12
zz[[`c d`]]
# Error: object 'c d' not found

Noting that backticks are not always appropriate: in some locations, they push R to look for an object with that name. Had we done zz[[`a b`]] here, it would not have erred, but that's because in the previous code block I created a variable named `a b`, and that's what it would have found, then resolving it into zz[[1]] (and therefore 11).

Getting back to your case, your variable names have spaces in them. With many base R (and some packages) data-reading functions, they tend to have check.names= or a similarly-purposes argument that will convert a name of a b into a.b, but readxl::read_excel does not do that, so it allows the spaces. While I'm of mixed-opinion on which is the perfect option, I think having spaces enclosed in variable names is a risk for new users. I do like that read_excel returns a tibble, and the presentation of tibbles tends to include (for visual reference if nothing else) backticks around not-legal names. For instance,

readxl::read_excel("Book2.xlsx")
# # A tibble: 1 x 3
#   `a b` `c d`    ef
#   <dbl> <dbl> <dbl>
# 1    11    22    33

which is a clear visual cue that the first two variable names need backtick enclosures.