I have a very dirty data that I need to read in R. The csv file seperated with comma, but there are also some commas in the first column of my data that puts me in a tight position. Something like this:
df <- tribble(
~reportname,
"column1, column2, column3, column4",
"some...texts, are here with commas, www.facebook.com, 2000, 150",
"some-texts_again_here, www.twitter.com, 1000, 50"
)
I now want to read it with the read_csv
.
df <- read_csv('df.csv', skip = 1)
However, it does not match the data in correct columns since there is comma inside some of the rows of the first column. Is there a way to read the csv file accurately?
CodePudding user response:
I think your efforts should be to fix the process or person that created that CSV, as it is fundamentally broken. The code below works, but especially in R, it is inefficient, worsened with large data and R's global string hash (see https://www.brodieg.com/2019/02/18/an-unofficial-reference-for-internal-inspect/ for some discussion on this, more exists).
txt <- readLines("df.csv")
# txt <- c("column1, column2, column3, column4", "some...texts, are here with commas, www.facebook.com, 2000, 150",
"some-texts_again_here, www.twitter.com, 1000, 50")
txtspl <- strsplit(txt, ",")
ncols <- length(txtspl[[1]]) - 1L # assuming the list of column names is safe
out <- do.call(rbind.data.frame, lapply(txtspl[-1], function(z) c(paste(head(z, -ncols), collapse = ","), tail(z, ncols))))
names(out) <- txtspl[[1]]
out[] <- lapply(out, type.convert, as.is = TRUE)
out
# column1 column2 column3 column4
# 1 some...texts, are here with commas www.facebook.com 2000 150
# 2 some-texts_again_here www.twitter.com 1000 50
str(out)
# 'data.frame': 2 obs. of 4 variables:
# $ column1 : chr "some...texts, are here with commas" "some-texts_again_here"
# $ column2: chr " www.facebook.com" " www.twitter.com"
# $ column3: int 2000 1000
# $ column4: int 150 50
You might prefer to trimws
your string columns.
CodePudding user response:
Here are some different ways.
The first alternative makes use of the fact that regular expressions are greedy so the first .* will slurp up all text that it can while still allowing the rest of the pattern to match.
The second alternative replaces the last 3 commas with semicolons also using the fact that .* is greedy.
The third approach uses pat
defined in the first approach and read.pattern
from gsubfn which combines the operations.
The fourth alternative places double quotes around the first field and then reads it in using read.csv. It uses pat
defined in the first alternative. It assumes that double quote is not already used.
In the first two if semicolon can appear in your file replace it in the code with any other character that does not appear.
# 1
pat <- "(.*),(.*),(.*),(.*)"
read.csv(text = sub(pat, "\\1;\\2;\\3;\\4", df[[1]]), sep = ";", strip.white=TRUE)
# 2
Lines <- df[[1]]
for(i in 1:3) Lines <- sub("(.*),", "\\1;", Lines)
read.csv(text = Lines, sep = ";", strip.white = TRUE)
# 3
library(gsubfn)
read.pattern(text = df[[1]], pattern = pat, header = TRUE, strip.white = TRUE)
# 4
read.csv(text = sub(pat, '"\\1",\\2,\\3,\\4', df[[1]]), strip.white = TRUE)