I am trying to create price range and do a lm model against price range dummy variable. So I did:
> #price range
> airbnblisting$PriceRange[price <= 500] <- 0
> airbnblisting$PriceRange[price > 500 & price <= 1000] <- 1
> airbnblisting$PriceRange[price > 1000] <- 2
Then run:
> r1 <- lm(review_scores_rating ~ PriceRange, data=airbnblisting,)
> summary(r1)
But the result shows as NA for priceRange. Any idea I can get the priceRange working properly?
Min 1Q Median 3Q Max
-4.7619 -0.0319 0.1281 0.2381 0.2381
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.761914 0.003115 1529 <2e-16 ***
PriceRange NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
price example:
$102.00
$179.00
$1140.00
$104.00
$539.00
$1090.00
$149.00
$44.00
$1500.00
$200.00
$153.00
$58.00
$350.00
CodePudding user response:
The dollar $
indicates you have character strings not numbers. You need to clean your data first.
Currently you're doing
dat$PriceRange[dat$price <= 500] <- 0
dat$PriceRange[dat$price > 500 & dat$price <= 1000] <- 1
dat$PriceRange[dat$price > 1000] <- 2
which yields all zero
dat$PriceRange
# [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
therefore:
lm(review ~ PriceRange, dat)$coe
# (Intercept) PriceRange
# 2.538462 NA
Now, we clean price
with gsub
, removing $
(needs to be escaped) |
(or) ,
for 1000 separators.
dat <- transform(dat, price=as.numeric(gsub('\\$|,', '', price)))
Now, price will be correctly identified as number
dat$PriceRange[dat$price <= 500] <- 0
dat$PriceRange[dat$price > 500 & dat$price <= 1000] <- 1
dat$PriceRange[dat$price > 1000] <- 2
dat$PriceRange
# [1] 0 0 2 0 1 2 0 0 2 0 0 0 2 0
And lm
should work.
lm(review ~ PriceRange, dat)$coe
# (Intercept) PriceRange
# 2.5350318 -0.1656051
More easily you could use cut
to create the dummy variable (assuming data is already clean).
dat <- transform(dat,
PriceRange=as.numeric(cut(price, c(0, 500, 1000, Inf),
labels=0:2)))
lm(review ~ PriceRange, dat)$coe
# (Intercept) PriceRange
# 2.7006369 -0.1656051
Note, that you attempt to code a categorical variable as continuous, which might statistically be problematic!
Data:
dat <- structure(list(review = c(4L, 4L, 1L, 3L, 2L, 2L, 3L, 0L, 2L,
3L, 2L, 3L, 4L, 1L), price = c("$102.00", "$179.00", "$1140.00",
"$104.00", "$539.00", "$1090.00", "$149.00", "$44.10", "$1500.00",
"$200.00", "$153.00", "$58.00", "$1,258.00", "$350.00")), class = "data.frame", row.names = c(NA,
-14L))