Home > other >  Why are the column names of a dataframe getting changed automatically?
Why are the column names of a dataframe getting changed automatically?

Time:12-10

I wanted to estimate the parameters for a linear regression model in R. The model is of the type: y=(alpha) (beta*x) epsilon. The task required me to place the values of the parameters systematically in a data frame. I thus created a blank data frame, and then went on appending rows into it for the values of the parameters.

df<-data.frame(alpha=double(),beta=double()) #blank dataframe
for(i in 1:1000)
{
    sample_dat<-sampling_model(100,2,5,16,-2,2) #generating 100 samples
    sample_model<-lm(y~x,data=sample_dat) #estimating the linear model
    df<-rbind(df,sample_model$coefficients) #appending the values of the parameters
}

Basically, I have a function sampling_model which is designed such that it generates random values for x_i's and epsilon_i's (both of which follow some distribution) and gives the values of y_i's adding those two with some fixed value of alpha and beta.

In each iteration of the above loop, we get a pair of values of the estimates of the parameters(alpha and beta) upon fitting a linear model to them. I want to store them in a data frame, which I've named df.

Initially (before starting the loop), names(df) returned:

#[1] "alpha" "beta"

However, after appending all those estimates of alpha and beta to df (i.e. after the loop), names(df) returned:

#[1] "X2.4932268478702"  "X5.53432974825338"

I am stuck here, asking myself why is this happening. Better to note that these names are also not constant. Like, if I run the above loop one more time and then check the name of the columns, the numbers are all different. Is something overflowing or have I made some mistake in appending the values to the data frame?

Also, I can (and did) get around this problem of 'ambiguous' names simply by:

names(df)<-c('alpha','beta')

But this does not hide the fact that I made something wrong while appending the estimated parameters in the df and I cannot figure that out. Can anyone help me out on how can this be avoided?


I am also attaching my sampling_model function for convenience:

sampling_model<-function(n,alpha,beta,variance,min_range,max_range)
{
    x<-runif(n,min=min_range,max=max_range) #n uniform variates as x_i
    epsilon<-rnorm(n,mean=0,sd=sqrt(variance)) #n normal variates as epsilon_i
    y<-alpha beta*x epsilon #the dependant variable y
    return(data.frame(x=x,y=y)) #returns dataframe of x and y
}

CodePudding user response:

I'm not sure why this happens, its strange behavior and seems to only happen when the first rbind argument has no rows. But rbinding data frame together in a loop is a very inefficient bad practice and should be avoided. It is famously the 2nd Circle of R Hell in The R Inferno.

The simplest alternative is to initialize your data to the full size, and then fill in each row:

n <- 1000
df <- data.frame(alpha=double(n),beta=double(n)) #blank dataframe
for(i in 1:n)
{
    sample_dat <- sampling_model(100,2,5,16,-2,2) #generating 100 samples
    sample_model <- lm(y~x,data=sample_dat) #estimating the linear model
    df[i, ] <- sample_model$coefficients #filling in the values of the parameters
}
  • Related