How to read a data file that separated by spaces instead of commas which continues onto many lines?-CodePudding

I'm struggling to try to read a file where the delimiter is a simple space(didn't work with read_tsv) and it continues onto the next row even though that isn't the end of the columns. Below is an example of what is supposed to be 2 rows of data (row 1 is bolded to help discern the two)

1 15943882 63 1 -9 -9 -9 -27 1 145 1 233 -9 50 20 1 0 1 2 2 3 1981 0 0 0 0 0 1 10.5 6 13 150 60 190 90 145 85 0 0 2.3 3 -9 -9 0 -9 -9 -9 -9 -9 -9 6 -9 -9 -9 2 16 1981 0 1 1 1 -9 1 -9 1 -9 1 1 1 1 1 1 1 -9 -9 0 -9 -9 -9 -9 -9 -9 -9 -9 -9 0 0 0 0 name 2 15964847 67 1 -9 -9 -9 -27 4 160 1 286 -9 40 40 0 0 1 2 3 5 1981 0 1 0 0 0 1 9.5 6 13 108 64 160 90 160 90 1 0 1.5 2 -9 -9 3 -9 -9 -9 -9 -9 -9 3 -9 -9 -9 2 5 1981 2 1 2 2 -9 2 -9 1 -9 1 1 1 1 1 1 1 -9 -9 0 -9 -9 -9 -9 -9 -9 -9 -9 -9 0 0 0 0 name

My attempt involved using read_delim with the following code


read_delim(file, delim = " ", col_names = FALSE)

But what it outputs is only 7 columns instead of 76 columns. Short of going into the data and trying to fix it by tediously backspacing, is there a way in R that I can just read it to get the right number of columns for my rows of data.

Thanks

CodePudding user response：

If the data set is really coming in as one uninterrupted string, we'll have to get creative. Per your example, each line ends with "...name". You could try the following:

library(tidyverse)

ends <- gregexpr('(?<=name)', txt, perl = T)[[1]] # each line ends in 'name', find those locations
starts <- c(1, ends[-length(ends)]) # create the starting point of each line
lines <- lapply(1:length(ends), function(s) substr(txt, starts[s], ends[s])) # separate the string into lines

df <- read_table(file = paste(lines, collapse = '\n'), col_names = F) # parse as a table

> glimpse(df)

Rows: 2
Columns: 91
$ X1  <dbl> 1, 2
$ X2  <dbl> 15943882, 15964847
$ X3  <dbl> 63, 67
$ X4  <dbl> 1, 1
$ X5  <dbl> -9, -9
$ X6  <dbl> -9, -9
$ X7  <dbl> -9, -9
$ X8  <dbl> -27, -27
$ X9  <dbl> 1, 4
$ X10 <dbl> 145, 160
...80 more columns...

Saving your example string as the variable txt:

txt <- '1 15943882 63 1 -9 -9 -9 -27 1 145 1 233 -9 50 20 1 0 1 2 2 3 1981 0 0 0 0 0 1 10.5 6 13 150 60 190 90 145 85 0 0 2.3 3 -9 -9 0 -9 -9 -9 -9 -9 -9 6 -9 -9 -9 2 16 1981 0 1 1 1 -9 1 -9 1 -9 1 1 1 1 1 1 1 -9 -9 0 -9 -9 -9 -9 -9 -9 -9 -9 -9 0 0 0 0 name 2 15964847 67 1 -9 -9 -9 -27 4 160 1 286 -9 40 40 0 0 1 2 3 5 1981 0 1 0 0 0 1 9.5 6 13 108 64 160 90 160 90 1 0 1.5 2 -9 -9 3 -9 -9 -9 -9 -9 -9 3 -9 -9 -9 2 5 1981 2 1 2 2 -9 2 -9 1 -9 1 1 1 1 1 1 1 -9 -9 0 -9 -9 -9 -9 -9 -9 -9 -9 -9 0 0 0 0 name'

CodePudding user response：

As was mentioned in a comment, the data file parses out to 90 columns, not 76, so you'll need to check on this point.

Looking at the file, it seems each 'row' of data is actually split across 12 lines in a 7-8-8-8-8-8-8-8-8-8-8-3 pattern. This appears consistent throughout, so the following code can be used to reshape it into a data frame (1,541 rows, 90 columns).

library(stringr)
library(dplyr)
txt <- readLines("new.data")
line_n <- tibble(from=seq(1,18481,12)) %>% mutate(to=from 11) ## Saving line numbers for the apparent start and end of each observation.

txt_lines <- list()
for(i in 1:nrow(line_n)){
  txt_lines[[i]] <- txt[line_n$from[i]:line_n$to[i]] %>% lapply(str_split," ") %>% unlist() %>% matrix(nrow=1,ncol=90) %>% as_tibble()
} ##Loop reshapes each set of 12 lines into 1 row and splits the variables using a space as the delimiter.

txt_lines <- txt_lines %>% bind_rows() ##Bind rows into one dataframe
txt_lines[,1:89] <- txt_lines[,1:89] %>% apply(2,as.numeric) ##Convert all except the last column into numeric columns