Split column of long strings into multiple columns based on string position-CodePudding

I have a txt file I need to turn into a usable dataframe in R. An example row looks like this:

10040 1491 32006 820085011  .007  .009  .043 -.003  .008  .036 -.031 -.036  .076  .056  .124  .093 -.112 -.091  .034  .043  .00600  .01200  .004500000000  .042333333333  .0568  .0058 -.0542 -.0304  .08625  .05425  .088857142857  .116142857143 -.072714285714 -.115571428571  .02125  .04350 8.71250 8.71825 8.729666666667 8.749500000000 8.6866 8.6722

I would like to end up with a df with multiple columns that looks like this:

10040 1 49 1 3 2006 8 2008 50 1 1 .007 .009 .043 -.003 .008 .036 -.031 -.036 .076 .056 .124 .093 -.112 -.091  .034 .043 .00600 .01200 .004500000000 .042333333333 .0568 .0058 -.0542 -.0304 .08625  .05425 .088857142857 .116142857143 -.072714285714 -.115571428571 .02125 .04350 8.71250 8.71825 8.729666666667 8.749500000000 8.6866 8.6722

The data is not perfectly split by spaces, or I would know how to do that. I know the positions of where to split the string into multiple columns: positions 5, 7, 9, 10, 12, 16, 18, 22, etc. but was wondering if there is a way to do this without 50 lines of code? Perhaps using tidyr's separate function? Can't seem to find any documentation or examples explaining how to use the sep parameter with numeric positions.

CodePudding user response：

After doing some more testing, turns out I can just pass a list of numeric values to split at in order to separate at those positions.

Brief example:

testDF <- separate(grossTXT,
               V1,
               c('M2ID', 'SAMPLMAJ', 'B1PAGE_M2', 'B1PRSEX', 'B5PEEGDATE_MO'),
               sep = c(5,7,9,10,12),
               remove = TRUE,
               convert = TRUE
               )

This works perfectly and lets me title each new column as well.

CodePudding user response：

Use a function call read.fwf which reads Fixed Width Format files.

Ie.

read.fwf(your_file, diff(c(0,5, 7, 9, 10, 12, 16, 18, 22)))

Note that I had to use diff since you have positions, and the file reads widths.

eg. in your example, if x was a text in R, you could use:

 read.fwf(textConnection(x), diff(c(0,5, 7, 9, 10, 12, 16, 18, 22)))

     V1 V2 V3 V4 V5   V6 V7   V8
1 10040  1 49  1  3 2006  8 2008