Home > Software engineering >  Debug lapply / loop script over n dfs
Debug lapply / loop script over n dfs

Time:10-13

I have 61 dfs with uniform structure. Eventually I would like to bind them all in one large df but before I’d like to:

select 4 columns from all df = df[ ,c(3,8,13,17)]
change three of the columns from factor to character using df$[columnName] = as.character(df$columnName)

Addition: all the dfs in the global environment have names starting with “9” (number 9). and end with ".csv’

The script that I have prepared is:

process_df = function( df_name ) {
  # import the CSV
  df = read.csv( df_name )
  
  # modify the df and return the result
  filtered = df[ ,c(3,8,13,17) ]
  filtered$Name = as.character( filtered$Name)
  filtered$Zipcode = as.character(filtered$Zipcode)
  filtered$Address = as.character(filtered$Address)
}

# loop over each of the file names
files = list.files( pattern = "*.csv" )
results_list = lapply( files, process_df )

# combine the results
result = do.call( "rbind", results_list )

but I get an error after running results_list = lapply(files, process_df):

Error in $<-.data.frame(*tmp*, "Address", value = character(0)) : replacement has 0 rows, data has 133652

CodePudding user response:

Issue Found

One of the dfs had one column out of place so the subset didn't grab "uniform" columns causing the error (probably). After fixing that, no errors appeared and I compiled 4 mil addresses! Thank you all for your wonderful suggestions, they made the script much cleaner!

In short, the script works as long as the .csv files in the directory are uniform. I guess a better script would grab the columns based on their names and not their position!

CodePudding user response:

As stated by Leon Samson, the problem is that the process_df function as written returns the contents of the filtered$Address column (a character vector), when in fact you should be returning the entire data.frame filtered

process_df = function( df_name ) {
  # import the CSV
  df = read.csv( df_name )
  
  # modify the df and return the result
  filtered = df[ ,c(3,8,13,17) ]
  filtered$Name = as.character( filtered$Name)
  filtered$Zipcode = as.character(filtered$Zipcode)
  filtered$Address = as.character(filtered$Address)

  # return data.frame "filtered"
  return(filtered)
}

# loop over each of the file names
files = list.files( pattern = "*.csv" )
results_list = lapply( files, process_df )

# combine the results
result = do.call( "rbind", results_list )

For a much faster version of your code, try data.table::rbindlist to combine your data.frames...

process_df = function( df_name ) {
  # import the CSV
  df = read.csv( df_name )
  
  # modify the df and return the result
  filtered = df[ ,c(3,8,13,17) ]
  filtered$Name = as.character( filtered$Name)
  filtered$Zipcode = as.character(filtered$Zipcode)
  filtered$Address = as.character(filtered$Address)

  # return data.frame "filtered"
  return(filtered)
}

# loop over each of the file names
files = list.files( pattern = "*.csv" )
results_list = lapply( files, process_df )

# combine the results
result = data.table::rbindlist( results_list )
result = as.data.frame( result )
  • Related