I have a csv file that stores three different record types of different lengths:
- A - The A record type has 7 columns
- B - The B record type has 20 columns
- C - The C record type has 30 columns
The file's records appear in alphabetical order. I have used the following readr script to import my file:
names <- c("col1", "col2", ...., "col30")
df <- read_csv("myfile", col_names = names)
Unfortunately when readr imports the file it only imports 7 columns. For record A the data is great, for records B and C, however, the 7th column consists of a mashup of columns 7-20 in the case of Record B and 7 -30 for record C. My thought was by specifying col1 thru Col30 readr would read all 30 columns.
I believe the issue might be that Record A takes the following form:
"A","USA","USA","20060727","9","D","TRUE"
It does not include 23 more blank/null fields, so perhaps readr thinks its work is done.
Does anyone know a workaround for this issue so that I can use readr to import all my data?
CodePudding user response:
If you only have one file or just a few you could insert a column header line manually, which mentiones all existing columns, before importing.
A way to work arround this issue from within R but not manipulating the original .csv files is to import the data as text lines (one column with all line content) and process it in R.
This is the content of the .csv file used for demo (no colnames):
1,2,3,4,5,6,7
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
I import this with the read_lines()
function from the readr
package as a colum of a new tibble and work a few processing steps:
library(dplyr)
library(stringr)
library(tidyr)
library(readr)
dplyr::tibble(rawdata = readr::read_lines('your_path/test.csv')) %>%
# insert line identifier
dplyr::mutate(fileline = dplyr::row_number(),
# remove " which in .csv are text delimiters and are processed correctly when improting as .csv but not as text/lines
rawdata = stringr::str_remove_all(rawdata, pattern = '\"')) %>%
# split acording to the delimiter (a comma in this case)
tidyr::separate_rows(rawdata, sep = ',') %>%
# build groupings according to line identifier
dplyr::group_by(fileline) %>%
# insert a line identifier per group (imported line)
dplyr::mutate(colnum = dplyr::row_number()) %>%
# release grouping to avoid unwanted behaviour down stream
dplyr::ungroup() %>%
# make the table wide
tidyr::pivot_wider(names_from = colnum, values_from = rawdata)
# you might want to rename the columns at this stage and/or drop the "fileline" column
# A tibble: 3 x 31
fileline `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11` `12` `13` `14` `15` `16` `17` `18` `19` `20` `21` `22` `23`
0s<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 2 3 4 5 6 7 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 NA NA NA
3 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
# ... with 7 more variables: `24` <chr>, `25` <chr>, `26` <chr>, `27` <chr>, `28` <chr>, `29` <chr>, `30` <chr>
just had a look into the fread()
function from the data.table
package which can be configured to do what you need (you need to set the argument fill = TRUE as the documentation points out "[...] in case the rows have unequal length, blank fields are implicitly filled [...]"):
data.table::fread("your_path/test.csv", fill = TRUE)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30
1: 1 2 3 4 5 6 7 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 NA NA NA NA NA NA NA NA NA NA
3: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30