Home > Software design >  R strsplit for uneven number of columns in a huge data set
R strsplit for uneven number of columns in a huge data set

Time:06-29

I have a huge data set with about 200 columns and 25k rows, with the separator ';'. The columns are of an uneven number. I read it in as a delimited txt file df <- read.delim(~path/data.txt, sep=";", header = FALSE) which reads nicely as a table. My issue is, many of the rows are so long that in the txt file they often spill onto new lines and it is here that it is not recognising that it should continue on the same row. Therefore the distinguished columns have information that belongs else where.

Each observation of data is a dbl.

I have created a new example below for ease of reading, therefore it is not possible to simply sort conditions into columns.

***EDIT: x, y and z contain spatial coordinates, but I have substituted them for their corresponding letters for ease of reading. The data is X-profile data giving me coordinates of the centre point along a line, followed by offsets of 1m (up to 100m either side of 0, the centre line) in each column with its corresponding height ***

My data ends up looking something like this:

    [c1]  [c2]  [c3]  [c4]  [c5]  [c6]  [c7]  [c8]  [c9]
[1]  x     y     z     1     2     3     N/A   N/A   N/A
[2]  x     y     z     1     2     3     4     5      6
[3]  7     8     9     10    N/A   N/A   N/A   N/A   N/A
[4]  x     y     z     1     2     3     4     5      7
[5]  7     8     9     N/A   N/A   N/A   N/A   N/A   N/A
[6]  x     y     z     1     2     3     N/A   N/A   N/A
[7]  x     y     z     1     2     3     4     5     N/A

And I'd like it to look like this:

    [c1]  [c2]  [c3]  [c4]  [c5]  [c6]  [c7]  [c8]  [c9]  [c10]  [c11]  [c12]  [c13]
[1]  x     y     z     1     2     3     N/A   N/A   N/A    N/A    N/A    N/A   N/A 
[2]  x     y     z     1     2     3     4     5      6     7      8      9      10
[3]  x     y     z     1     2     3     4     5      6     7      8      9     N/A    
[4]  x     y     z     1     2     3     N/A   N/A   N/A    N/A    N/A    N/A   N/A 
[5]  x     y     z     1     2     3     4     5     N/A    N/A    N/A    N/A   N/A

I have tried strsplit(as.character(df), split = "\n", fixed = TRUE) and it returns an error that it is not a character string. I have tried the same function with split = "\t" and split = "\r" and it returns the same error. Each attempt takes around half an hour to process so I was also wondering if there is a more efficient way to do this.

I hope I have explained clearly my aim.

EDIT The text file is similar to the following example:

x;y;z;1;2;3

x;y;z;1;2;3;4;5;6;

7;8;9;10

x;y;z;1;2;3;4;5;6;

7;8;9

x;y;z;1;2;3;4

x;y;z;1;2;3;4;5;6;

7;8;9;10;11;12;13;

14;15

In some cases a number is split between the previous line and that below: E.G.

101;102;103;10

4;105;106

This layout is exactly how it is being read into R.

CodePudding user response:

using data.table:

dt <- data.table(df)

dt[, grp := cumsum(c1 == "x")]
dt <- merge(dt[c1 == "x"], dt[c1 == 7], by = "grp", all = T)[, grp := NULL]

names(dt) <- paste0("c", 1:ncol(dt))

Resulting to:

   c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18
1:  x  y  z  1  2  3 NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
2:  x  y  z  1  2  3  4  5  6   7   8   9  10  NA  NA  NA  NA  NA
3:  x  y  z  1  2  3  4  5  7   7   8   9  NA  NA  NA  NA  NA  NA
4:  x  y  z  1  2  3 NA NA NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
5:  x  y  z  1  2  3  4  5 NA  NA  NA  NA  NA  NA  NA  NA  NA  NA

CodePudding user response:

Use scan which omits empty lines by default. Next, find positions that begin with "x" using findInterval, split there and paste them together. Then basically the ususal strsplit, some length adaptions etc. and you got it.

r <- scan('foo.txt', 'A', qui=T)

r <- split(r, findInterval(seq_len(length(r)), grep('^x', r))) |>
  lapply(paste, collapse='') |>
  lapply(strsplit, ';') |>
  lapply(el) |>
  {\(.) lapply(., `length<-`, max(lengths(.)))}() |>
  do.call(what=rbind) |>
  as.data.frame()

r
#    V1 V2 V3 V4 V5 V6   V7   V8   V9  V10  V11  V12  V13  V14  V15  V16  V17  V18
#  1  x  y  z  1  2  3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#  2  x  y  z  1  2  3    4    5    6    7    8    9   10 <NA> <NA> <NA> <NA> <NA>
#  3  x  y  z  1  2  3    4    5    6    7    8    9 <NA> <NA> <NA> <NA> <NA> <NA>
#  4  x  y  z  1  2  3    4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#  5  x  y  z  1  2  3    4    5    6    7    8    9   10   11   12   13   14   15

Data:

writeLines(text='x;y;z;1;2;3

x;y;z;1;2;3;4;5;6;

7;8;9;10

x;y;z;1;2;3;4;5;6;

7;8;9

x;y;z;1;2;3;4

x;y;z;1;2;3;4;5;6;

7;8;9;10;11;12;13;

14;15', 'foo.txt')
  • Related