Read in CSV in mixed English and French number format-CodePudding

I would like to read the a CSV into R that is quoted, comma-separated (i.e. sep = "," not sep = ";" as read.csv2 defaults to) but that

uses the comma inside fields as the decimal separator
contains periods to separate each group of three digits from the right

An example of a problematic entry is "3.051,00" in the final line of the excerpt from the CSV shown.

I tried

dat <- read.csv2("path_to_csv.csv", sep = ",", stringsAsFactors = FALSE)

and a variant using read.csv (both are identical except for their defaults as noted in Difference between read.csv() and read.csv2() in R. Both return improperly-formatted data.frames (e.g. containing 3.051,00).

Can I read this comma-separated file in directly with read.table without having to perform text-preprocessing?

Excerpt of CSV

praf,pmek,plcg,PIP2,PIP3,p44/42,pakts473,PKA,PKC,P38,pjnk
"26,40","13,20","8,82","18,30","58,80","6,61","17,00","414,00","17,00","44,90","40,00"
"35,90","16,50","12,30","16,80","8,13","18,60","32,50","352,00","3,37","16,50","61,50"
"59,40","44,10","14,60","10,20","13,00","14,90","32,50","403,00","11,40","31,90","19,50"
"62,10","51,90","13,60","30,20","10,60","14,30","37,90","692,00","6,49","25,00","91,40"
"75,00","33,40","1,00","31,60","1,00","19,80","27,60","505,00","18,60","31,10","7,64"
"20,40","15,10","7,99","101,00","35,90","9,14","22,90","400,00","11,70","22,70","6,85"
"47,80","19,60","17,50","33,10","82,00","17,90","35,20","956,00","22,50","43,30","20,00"
"59,90","53,30","11,80","77,70","12,90","11,10","37,90","1.407,00","18,80","29,40","16,80"
"46,60","27,10","12,40","109,00","21,90","21,50","38,20","207,00","11,00","31,30","12,00"
"51,90","21,30","49,10","58,80","10,80","58,80","200,00","3.051,00","15,30","39,20","15,70"

Note: I am aware of the question European and American decimal format for thousands, which is not sufficient. This user preprocesses the file they want to read in whereas I would like a direct means of reading a CSV of the kind shown into R.

CodePudding user response：

Most of it is resolved with dec=",",

# saved your data to 'file.csv'
out <- read.csv("file.csv", dec=",")
head(out)
#   praf pmek  plcg  PIP2  PIP3 p44.42 pakts473    PKA   PKC  P38  pjnk
# 1 26.4 13.2  8.82  18.3 58.80   6.61     17.0 414,00 17.00 44.9 40.00
# 2 35.9 16.5 12.30  16.8  8.13  18.60     32.5 352,00  3.37 16.5 61.50
# 3 59.4 44.1 14.60  10.2 13.00  14.90     32.5 403,00 11.40 31.9 19.50
# 4 62.1 51.9 13.60  30.2 10.60  14.30     37.9 692,00  6.49 25.0 91.40
# 5 75.0 33.4  1.00  31.6  1.00  19.80     27.6 505,00 18.60 31.1  7.64
# 6 20.4 15.1  7.99 101.0 35.90   9.14     22.9 400,00 11.70 22.7  6.85

Only one column is string:

sapply(out, class)
#        praf        pmek        plcg        PIP2        PIP3      p44.42    pakts473         PKA         PKC         P38 
#   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" "character"   "numeric"   "numeric" 
#        pjnk 
#   "numeric"

This can be resolved post-read with:

ischr <- sapply(out, is.character)
out[ischr] <- lapply(out[ischr], function(z) as.numeric(gsub(" ", "", chartr(",.", ". ", z))))
out$PKA
#  [1]  414  352  403  692  505  400  956 1407  207 3051

If you'd rather read it in without post-processing, you can pipe(.) it, assuming you have sed available[^1]:

out <- read.csv(pipe("sed -E 's/([0-9])[.]([0-9])/\\1\\2/g;s/([0-9]),([0-9])/\\1.\\2/g' < file.csv"))

Notes:

sed is generally available on all linux/macos systems, and on windows computers it is included within Rtools.

CodePudding user response：

Like r2evans's comment says, dec = "," takes care of the cases without thousands separators. Then use lapply/gsub to process the other cases, which are still of class "character".

txt <- '
praf,pmek,plcg,PIP2,PIP3,p44/42,pakts473,PKA,PKC,P38,pjnk
"26,40","13,20","8,82","18,30","58,80","6,61","17,00","414,00","17,00","44,90","40,00"
"35,90","16,50","12,30","16,80","8,13","18,60","32,50","352,00","3,37","16,50","61,50"
"59,40","44,10","14,60","10,20","13,00","14,90","32,50","403,00","11,40","31,90","19,50"
"62,10","51,90","13,60","30,20","10,60","14,30","37,90","692,00","6,49","25,00","91,40"
"75,00","33,40","1,00","31,60","1,00","19,80","27,60","505,00","18,60","31,10","7,64"
"20,40","15,10","7,99","101,00","35,90","9,14","22,90","400,00","11,70","22,70","6,85"
"47,80","19,60","17,50","33,10","82,00","17,90","35,20","956,00","22,50","43,30","20,00"
"59,90","53,30","11,80","77,70","12,90","11,10","37,90","1.407,00","18,80","29,40","16,80"
"46,60","27,10","12,40","109,00","21,90","21,50","38,20","207,00","11,00","31,30","12,00"
"51,90","21,30","49,10","58,80","10,80","58,80","200,00","3.051,00","15,30","39,20","15,70"
'

df1 <- read.csv(textConnection(txt), dec = ",")

i <- sapply(df1, is.character)
df1[i] <- lapply(df1[i], \(x) gsub("\\.", "", x))
df1[i] <- lapply(df1[i], \(x) as.numeric(sub(",", ".", x)))
df1
#>    praf pmek  plcg  PIP2  PIP3 p44.42 pakts473  PKA   PKC  P38  pjnk
#> 1  26.4 13.2  8.82  18.3 58.80   6.61     17.0  414 17.00 44.9 40.00
#> 2  35.9 16.5 12.30  16.8  8.13  18.60     32.5  352  3.37 16.5 61.50
#> 3  59.4 44.1 14.60  10.2 13.00  14.90     32.5  403 11.40 31.9 19.50
#> 4  62.1 51.9 13.60  30.2 10.60  14.30     37.9  692  6.49 25.0 91.40
#> 5  75.0 33.4  1.00  31.6  1.00  19.80     27.6  505 18.60 31.1  7.64
#> 6  20.4 15.1  7.99 101.0 35.90   9.14     22.9  400 11.70 22.7  6.85
#> 7  47.8 19.6 17.50  33.1 82.00  17.90     35.2  956 22.50 43.3 20.00
#> 8  59.9 53.3 11.80  77.7 12.90  11.10     37.9 1407 18.80 29.4 16.80
#> 9  46.6 27.1 12.40 109.0 21.90  21.50     38.2  207 11.00 31.3 12.00
#> 10 51.9 21.3 49.10  58.8 10.80  58.80    200.0 3051 15.30 39.2 15.70

^{Created on 2022-02-07 by the reprex package (v2.0.1)}