Creating a variable with randomized number from an old variable (with a higher population)-CodePudding

I am not very good at R, and I have a problem. As I want to do a linear regression between two variables from different datasets, i run into the proble, that one dataset is way bigger than the other. So, in order to bypass that problem, I want to create a smaller variable with an equal population, randomly selected from the greater datasets variable. What is the command for that? And if any specification is needed for that, please let me know! Thank you so much for your help!

Tried to make a liner regression out of two datasets, but as one is bigger than the other, it did not help, and the line (error)

Error in model.frame.default(formula = lobby_expenditure$expend \~ compustat$lct,  :
variable lengths differ (found for 'compustat$lct')

appeared

CodePudding user response：

Here is a simple example; y comes from d2 and a sample of rows from d1 are selected for x

d1=data.frame(x=rnorm(100))
d2=data.frame(y=rnorm(10))

lm(d2$y~d1[sample(1:nrow(d1),nrow(d2)),"x"])

CodePudding user response：

To get any sample rows, use dplyr::sample_n Example : dataset :

df2 <- read_table('Individual Site
1          A
2          B
3          A
4          C
5          C
6          B
7          A
8          B
9          C')

with sample_n(df2,2) where 2 is number of samples you want, you can get random rows. The following output may differ in your case since its random.

#A tibble: 2 x 2
  Individual Site 
       <dbl> <chr>
1          4 C    
2          5 C