I would like to read the a CSV into R that is quoted, comma-separated (i.e. sep = ","
not sep = ";"
as read.csv2
defaults to) but that
- uses the comma inside fields as the decimal separator
- contains periods to separate each group of three digits from the right
An example of a problematic entry is "3.051,00"
in the final line of the excerpt from the CSV shown.
I tried
dat <- read.csv2("path_to_csv.csv", sep = ",", stringsAsFactors = FALSE)
and a variant using read.csv
(both are identical except for their defaults as noted in Difference between read.csv() and read.csv2() in R. Both return improperly-formatted data.frame
s (e.g. containing 3.051,00
).
Can I read this comma-separated file in directly with read.table
without having to perform text-preprocessing?
Excerpt of CSV
praf,pmek,plcg,PIP2,PIP3,p44/42,pakts473,PKA,PKC,P38,pjnk
"26,40","13,20","8,82","18,30","58,80","6,61","17,00","414,00","17,00","44,90","40,00"
"35,90","16,50","12,30","16,80","8,13","18,60","32,50","352,00","3,37","16,50","61,50"
"59,40","44,10","14,60","10,20","13,00","14,90","32,50","403,00","11,40","31,90","19,50"
"62,10","51,90","13,60","30,20","10,60","14,30","37,90","692,00","6,49","25,00","91,40"
"75,00","33,40","1,00","31,60","1,00","19,80","27,60","505,00","18,60","31,10","7,64"
"20,40","15,10","7,99","101,00","35,90","9,14","22,90","400,00","11,70","22,70","6,85"
"47,80","19,60","17,50","33,10","82,00","17,90","35,20","956,00","22,50","43,30","20,00"
"59,90","53,30","11,80","77,70","12,90","11,10","37,90","1.407,00","18,80","29,40","16,80"
"46,60","27,10","12,40","109,00","21,90","21,50","38,20","207,00","11,00","31,30","12,00"
"51,90","21,30","49,10","58,80","10,80","58,80","200,00","3.051,00","15,30","39,20","15,70"
Note: I am aware of the question European and American decimal format for thousands, which is not sufficient. This user preprocesses the file they want to read in whereas I would like a direct means of reading a CSV of the kind shown into R.
CodePudding user response:
Most of it is resolved with dec=","
,
# saved your data to 'file.csv'
out <- read.csv("file.csv", dec=",")
head(out)
# praf pmek plcg PIP2 PIP3 p44.42 pakts473 PKA PKC P38 pjnk
# 1 26.4 13.2 8.82 18.3 58.80 6.61 17.0 414,00 17.00 44.9 40.00
# 2 35.9 16.5 12.30 16.8 8.13 18.60 32.5 352,00 3.37 16.5 61.50
# 3 59.4 44.1 14.60 10.2 13.00 14.90 32.5 403,00 11.40 31.9 19.50
# 4 62.1 51.9 13.60 30.2 10.60 14.30 37.9 692,00 6.49 25.0 91.40
# 5 75.0 33.4 1.00 31.6 1.00 19.80 27.6 505,00 18.60 31.1 7.64
# 6 20.4 15.1 7.99 101.0 35.90 9.14 22.9 400,00 11.70 22.7 6.85
Only one column is string:
sapply(out, class)
# praf pmek plcg PIP2 PIP3 p44.42 pakts473 PKA PKC P38
# "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "character" "numeric" "numeric"
# pjnk
# "numeric"
This can be resolved post-read with:
ischr <- sapply(out, is.character)
out[ischr] <- lapply(out[ischr], function(z) as.numeric(gsub(" ", "", chartr(",.", ". ", z))))
out$PKA
# [1] 414 352 403 692 505 400 956 1407 207 3051
If you'd rather read it in without post-processing, you can pipe(.)
it, assuming you have sed
available[^1]:
out <- read.csv(pipe("sed -E 's/([0-9])[.]([0-9])/\\1\\2/g;s/([0-9]),([0-9])/\\1.\\2/g' < file.csv"))
Notes:
sed
is generally available on all linux/macos systems, and on windows computers it is included within Rtools.
CodePudding user response:
Like r2evans's comment says, dec = ","
takes care of the cases without thousands separators. Then use lapply/gsub
to process the other cases, which are still of class "character"
.
txt <- '
praf,pmek,plcg,PIP2,PIP3,p44/42,pakts473,PKA,PKC,P38,pjnk
"26,40","13,20","8,82","18,30","58,80","6,61","17,00","414,00","17,00","44,90","40,00"
"35,90","16,50","12,30","16,80","8,13","18,60","32,50","352,00","3,37","16,50","61,50"
"59,40","44,10","14,60","10,20","13,00","14,90","32,50","403,00","11,40","31,90","19,50"
"62,10","51,90","13,60","30,20","10,60","14,30","37,90","692,00","6,49","25,00","91,40"
"75,00","33,40","1,00","31,60","1,00","19,80","27,60","505,00","18,60","31,10","7,64"
"20,40","15,10","7,99","101,00","35,90","9,14","22,90","400,00","11,70","22,70","6,85"
"47,80","19,60","17,50","33,10","82,00","17,90","35,20","956,00","22,50","43,30","20,00"
"59,90","53,30","11,80","77,70","12,90","11,10","37,90","1.407,00","18,80","29,40","16,80"
"46,60","27,10","12,40","109,00","21,90","21,50","38,20","207,00","11,00","31,30","12,00"
"51,90","21,30","49,10","58,80","10,80","58,80","200,00","3.051,00","15,30","39,20","15,70"
'
df1 <- read.csv(textConnection(txt), dec = ",")
i <- sapply(df1, is.character)
df1[i] <- lapply(df1[i], \(x) gsub("\\.", "", x))
df1[i] <- lapply(df1[i], \(x) as.numeric(sub(",", ".", x)))
df1
#> praf pmek plcg PIP2 PIP3 p44.42 pakts473 PKA PKC P38 pjnk
#> 1 26.4 13.2 8.82 18.3 58.80 6.61 17.0 414 17.00 44.9 40.00
#> 2 35.9 16.5 12.30 16.8 8.13 18.60 32.5 352 3.37 16.5 61.50
#> 3 59.4 44.1 14.60 10.2 13.00 14.90 32.5 403 11.40 31.9 19.50
#> 4 62.1 51.9 13.60 30.2 10.60 14.30 37.9 692 6.49 25.0 91.40
#> 5 75.0 33.4 1.00 31.6 1.00 19.80 27.6 505 18.60 31.1 7.64
#> 6 20.4 15.1 7.99 101.0 35.90 9.14 22.9 400 11.70 22.7 6.85
#> 7 47.8 19.6 17.50 33.1 82.00 17.90 35.2 956 22.50 43.3 20.00
#> 8 59.9 53.3 11.80 77.7 12.90 11.10 37.9 1407 18.80 29.4 16.80
#> 9 46.6 27.1 12.40 109.0 21.90 21.50 38.2 207 11.00 31.3 12.00
#> 10 51.9 21.3 49.10 58.8 10.80 58.80 200.0 3051 15.30 39.2 15.70
Created on 2022-02-07 by the reprex package (v2.0.1)