Home > database >  Colnames unexpectedly updating variable in R
Colnames unexpectedly updating variable in R

Time:03-02

I'm trying to get a list of column names that have been added after the initial csv load. If I am not updating the variable after column names are added, then how are they being added to the variable?

I would expect that only Name and Age would get printed from my_cols but it is printing IsJon as well

library(data.table)

Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
Age <- c(23, 41, 32, 58, 26)

df <- data.table(Name, Age)

my_cols <- colnames(df)

print(my_cols)

df[,isJon:=ifelse(Name=="John", 1, 0)]

print(my_cols)

CodePudding user response:

There are at least two things going on here:

  • R is inherently lazy with objects, and when you create my_cols <- colnames(df), it isn't changing anything so it does not create a duplicate vector of names. The moment you do something to the vector of names that "could" be changing it, R copies the vector from the frame's attributes and creates a new one, thereby not changing when the original frame is updated.

  • data.table tends to do things in-place with its referential semantics, so when it adds a column, the internal storage of column names is appended in-place, contrary to R's normal way of doing things. Normally, data.frame changes creates a new vector of names when you add one.

    C.f., base::data.frame, adding a column creates a new vector of column names, therefore our my_cols does not magically stay updated:

    Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
    Age <- c(23, 41, 32, 58, 26)
    df <- data.frame(Name, Age)
    my_cols <- colnames(df)
    print(my_cols)
    # [1] "Name" "Age" 
    df <- transform(df, isJon=ifelse(Name=="John", 1, 0))
    print(my_cols)
    # [1] "Name" "Age" 
    

There a couple of ways you can get these two things to work in the direction you were heading:

  1. copy the vector, which forces it to be a new copy (yes, good name) of the vector.

    Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
    Age <- c(23, 41, 32, 58, 26)
    df <- data.table(Name, Age)
    my_cols <- copy(colnames(df))
    print(my_cols)
    # [1] "Name" "Age" 
    df[,isJon:=ifelse(Name=="John", 1, 0)]
    print(my_cols)
    # [1] "Name" "Age" 
    
  2. Do "something" to the vector, making R think it should copy-on-write:

    Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
    Age <- c(23, 41, 32, 58, 26)
    df <- data.table(Name, Age)
    my_cols <- colnames(df)[]
    print(my_cols)
    # [1] "Name" "Age" 
    df[,isJon:=ifelse(Name=="John", 1, 0)]
    print(my_cols)
    # [1] "Name" "Age" 
    
  •  Tags:  
  • r
  • Related