How to create dummy variable with range-CodePudding

I am trying to create price range and do a lm model against price range dummy variable. So I did:

> #price range 
> airbnblisting$PriceRange[price <= 500] <- 0 
> airbnblisting$PriceRange[price > 500 & price <= 1000] <- 1
> airbnblisting$PriceRange[price > 1000] <- 2

Then run:

> r1 <- lm(review_scores_rating ~ PriceRange, data=airbnblisting,)
> summary(r1)

But the result shows as NA for priceRange. Any idea I can get the priceRange working properly?

    Min      1Q  Median      3Q     Max 
-4.7619 -0.0319  0.1281  0.2381  0.2381 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.761914   0.003115    1529   <2e-16 ***
PriceRange        NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

price example:

$102.00 
$179.00 
$1140.00 
$104.00 
$539.00 
$1090.00 
$149.00 
$44.00 
$1500.00 
$200.00 
$153.00 
$58.00 
$350.00

CodePudding user response：

The dollar $ indicates you have character strings not numbers. You need to clean your data first.

Currently you're doing

dat$PriceRange[dat$price <= 500] <- 0 
dat$PriceRange[dat$price > 500 & dat$price <= 1000] <- 1
dat$PriceRange[dat$price > 1000] <- 2

which yields all zero

dat$PriceRange
# [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0

therefore:

lm(review ~ PriceRange, dat)$coe
# (Intercept)  PriceRange 
#   2.538462          NA

Now, we clean price with gsub, removing $ (needs to be escaped) | (or) , for 1000 separators.

dat <- transform(dat, price=as.numeric(gsub('\\$|,', '', price)))

Now, price will be correctly identified as number

dat$PriceRange[dat$price <= 500] <- 0 
dat$PriceRange[dat$price > 500 & dat$price <= 1000] <- 1
dat$PriceRange[dat$price > 1000] <- 2

dat$PriceRange
# [1] 0 0 2 0 1 2 0 0 2 0 0 0 2 0

And lm should work.

lm(review ~ PriceRange, dat)$coe
# (Intercept)  PriceRange 
#   2.5350318  -0.1656051

More easily you could use cut to create the dummy variable (assuming data is already clean).

dat <- transform(dat,
                 PriceRange=as.numeric(cut(price, c(0, 500, 1000, Inf), 
                                           labels=0:2)))
lm(review ~ PriceRange, dat)$coe
# (Intercept)  PriceRange 
#   2.7006369  -0.1656051

Note, that you attempt to code a categorical variable as continuous, which might statistically be problematic!

Data:

dat <- structure(list(review = c(4L, 4L, 1L, 3L, 2L, 2L, 3L, 0L, 2L, 
3L, 2L, 3L, 4L, 1L), price = c("$102.00", "$179.00", "$1140.00", 
"$104.00", "$539.00", "$1090.00", "$149.00", "$44.10", "$1500.00", 
"$200.00", "$153.00", "$58.00", "$1,258.00", "$350.00")), class = "data.frame", row.names = c(NA, 
-14L))