Home > front end >  How to read dirty csv file in R?
How to read dirty csv file in R?

Time:05-06

I have a very dirty data that I need to read in R. The csv file seperated with comma, but there are also some commas in the first column of my data that puts me in a tight position. Something like this:

df <- tribble(
  ~reportname, 
  "column1, column2, column3, column4",
  "some...texts, are here with commas, www.facebook.com, 2000, 150",
  "some-texts_again_here, www.twitter.com, 1000, 50"
)

I now want to read it with the read_csv.

df <- read_csv('df.csv', skip = 1)

However, it does not match the data in correct columns since there is comma inside some of the rows of the first column. Is there a way to read the csv file accurately?

CodePudding user response:

I think your efforts should be to fix the process or person that created that CSV, as it is fundamentally broken. The code below works, but especially in R, it is inefficient, worsened with large data and R's global string hash (see https://www.brodieg.com/2019/02/18/an-unofficial-reference-for-internal-inspect/ for some discussion on this, more exists).

txt <- readLines("df.csv")
# txt <- c("column1, column2, column3, column4", "some...texts, are here with commas, www.facebook.com, 2000, 150", 
"some-texts_again_here, www.twitter.com, 1000, 50")

txtspl <- strsplit(txt, ",")
ncols <- length(txtspl[[1]]) - 1L # assuming the list of column names is safe
out <- do.call(rbind.data.frame, lapply(txtspl[-1], function(z) c(paste(head(z, -ncols), collapse = ","), tail(z, ncols))))
names(out) <- txtspl[[1]]
out[] <- lapply(out, type.convert, as.is = TRUE)
out
#                              column1           column2  column3  column4
# 1 some...texts, are here with commas  www.facebook.com     2000      150
# 2              some-texts_again_here   www.twitter.com     1000       50
str(out)
# 'data.frame': 2 obs. of  4 variables:
#  $ column1 : chr  "some...texts, are here with commas" "some-texts_again_here"
#  $  column2: chr  " www.facebook.com" " www.twitter.com"
#  $  column3: int  2000 1000
#  $  column4: int  150 50

You might prefer to trimws your string columns.

CodePudding user response:

Here are some different ways.

The first alternative makes use of the fact that regular expressions are greedy so the first .* will slurp up all text that it can while still allowing the rest of the pattern to match.

The second alternative replaces the last 3 commas with semicolons also using the fact that .* is greedy.

The third approach uses pat defined in the first approach and read.pattern from gsubfn which combines the operations.

The fourth alternative places double quotes around the first field and then reads it in using read.csv. It uses pat defined in the first alternative. It assumes that double quote is not already used.

In the first two if semicolon can appear in your file replace it in the code with any other character that does not appear.

# 1
pat <- "(.*),(.*),(.*),(.*)"
read.csv(text = sub(pat, "\\1;\\2;\\3;\\4", df[[1]]), sep = ";", strip.white=TRUE)

# 2
Lines <- df[[1]]
for(i in 1:3) Lines <- sub("(.*),", "\\1;", Lines)
read.csv(text = Lines, sep = ";", strip.white = TRUE)

# 3
library(gsubfn)
read.pattern(text = df[[1]], pattern = pat, header = TRUE, strip.white = TRUE)

# 4
read.csv(text = sub(pat, '"\\1",\\2,\\3,\\4', df[[1]]), strip.white = TRUE)
  • Related