Dealing with character variables containing semicolons in CSV files-CodePudding

I have a file separated by semicolons in which one of the variables of type character contains semicolon inside it. The readr::read_csv2 function splits the contents of those variables that have semicolons into more columns, messing up the formatting of the file.

For example, when using read_csv2 to open the file below, Bill's age column will show jogging, not 41.

File:

name;hobbies;age
Jon;cooking;38
Bill;karate;jogging;41
Maria;fishing;32

Considering that the original file doesn't contain quotes around the character type variables, how can I import the file so that karate and jogging belong in the hobbies column?

CodePudding user response：

read.csv()

You can use the read.csv() function. But there would be some warning messages (or use suppressWarnings() to wrap around the read.csv() function). If you wish to avoid warning messages, using the scan() method in the next section.

library(dplyr)

read.csv("./path/to/your/file.csv", sep = ";", 
         col.names = c("name", "hobbies", "age", "X4")) %>% 
  mutate(hobbies = ifelse(is.na(X4), hobbies, paste0(hobbies, ";" ,age)), 
         age = ifelse(is.na(X4), age, X4)) %>% 
  select(-X4)

scan() file

You can first scan() the CSV file as a character vector first, then split the string with pattern ; and change it into a dataframe. After that, do some mutate() to identify your target column and remove unnecessary columns. Finally, use the first row as the column name.

library(tidyverse)
library(janitor)

semicolon_file <- scan(file = "./path/to/your/file.csv", character())

semicolon_df <- data.frame(str_split(semicolon_file, ";", simplify = T))

semicolon_df %>% 
  mutate(X4 = na_if(X4, ""), 
         X2 = ifelse(is.na(X4), X2, paste0(X2, ";" ,X3)), 
         X3 = ifelse(is.na(X4), X3, X4)) %>% 
  select(-X4) %>% 
  janitor::row_to_names(row_number = 1)

Output

   name        hobbies age
2   Jon        cooking  38
3  Bill karate;jogging  41
4 Maria        fishing  32

CodePudding user response：

Assuming that you have the columns name and age with a single entry per observation and hobbies with possible multiple entries the following approach works:

read in the file line by line instead of treating it as a table:

tmp <- readLines(con <- file("table.csv"))
close(con)

Find the position of the separator in every row. The entry before the first separator is the name the entry after the last is the age:

separator_pos <- gregexpr(";", tmp)
name <- character(length(tmp) - 1)
age <- integer(length(tmp) - 1)
hobbies <- vector(length=length(tmp) - 1, "list")

fill the three elements using a for loop:

# the first line are the colnames
for(line in 2:length(tmp)){
     # from the beginning of the row to the first";"
     name[line-1] <- strtrim(tmp[line], separator_pos[[line]][1] -1) 
     # between the first ";" and the last ";".
     # Every ";" is a different elemet of the list
     hobbies[line-1] <- strsplit(substr(tmp[line], separator_pos[[line]][1]  1,  
                                        separator_pos[[line]][length(separator_pos[[line]])]-1),";")
     #after the last ";", must be an integer
     age[line-1] <- as.integer(substr(tmp[line],separator_pos[[line]][length(separator_pos[[line]])] 1, 
                               nchar(tmp[line])))
}

Create a separate matrix to hold the hobbies and fill it rowwise:

hobbies_matrix <- matrix(NA_character_, nrow = length(hobbies), ncol = max(lengths(hobbies)))
for(line in 1:length(hobbies)) 
    hobbies_matrix[line,1:length(hobbies[[line]])] <-  hobbies[[line]]

Add all variable to a data.frame:

df <- data.frame(name = name, hobbies = hobbies_matrix, age = age)
> df
   name hobbies.1 hobbies.2 age
1   Jon   cooking      <NA>  38
2  Bill    karate   jogging  41
3 Maria   fishing      <NA>  32

CodePudding user response：

You could also do:

read.csv(text=gsub('(^[^;] );|;([^;] $)', '\\1,\\2', readLines('file.csv')))

   name        hobbies age
1   Jon        cooking  38
2  Bill karate;jogging  41
3 Maria        fishing  32

CodePudding user response：

Ideally you'd ask whoever generated the file to do it properly next time :) but of course this is not always possible.

Easiest way is probably to read the lines from the file into a character vector, then clean up and make a data frame by string matching.

library(readr)
library(dplyr)
library(stringr)

# skip header, add it later
dataset <- read_lines("your_file.csv", skip = 1)

dataset_df <- data.frame(name = str_match(dataset, "^(.*?);")[, 2], 
                         hobbies = str_match(dataset, ";(.*?);\\d")[, 2], 
                         age = as.numeric(str_match(dataset, ";(\\d )$")[, 2]))

Result:

   name        hobbies age
1   Jon        cooking  38
2  Bill karate;jogging  41
3 Maria        fishing  32

CodePudding user response：

Using the file created in the Note at the end

1) read.pattern can read this by specifying the pattern as a regular expression with the portions within parentheses representing the fields.

library(gsubfn)
read.pattern("hobbies.csv", pattern = '^(.*?);(.*);(.*)$', header = TRUE)
##    name        hobbies age
## 1   Jon        cooking  38
## 2  Bill karate;jogging  41
## 3 Maria        fishing  32

2) Base R Using base R we can read in the lines, put quotes around the middle field and then read it in normally.

L <- "hobbies.csv" |>
  readLines() |>
  sub(pattern = ';(.*);', replacement = ';"\\1";')
read.csv2(text = L)
##    name        hobbies age
## 1   Jon        cooking  38
## 2  Bill karate;jogging  41
## 3 Maria        fishing  32

Note

Lines <- "name;hobbies;age
Jon;cooking;38
Bill;karate;jogging;41
Maria;fishing;32
"
cat(Lines, file = "hobbies.csv")