I try to read in a .csv file with, example, such a column:
These values are meant like they are representing thousands of hours, not two or three hours and so on. When I try to change the reading in options through
read.csv(file, sep = ";, dec = ".")
nothing changes. It doesn't matter what I define, dec = "."
or dec = ","
it will always keep these numbers above.
CodePudding user response:
You can use the following code:
library(readr)
df <- read_csv('data.csv', locale = locale(grouping_mark = "."))
df
Output:
# A tibble: 4 × 1
`X-ray`
<dbl>
1 2771
2 3783
3 1267
4 7798
As you can see, the values are now thousands.
CodePudding user response:
An elegant way (in my opinion) is to create a new class, which you then use in the reading process.
This way, you stay flexible when your data is (really) messed up and the decimal/thousand separator is not equal over all (numeric) columns.
# Define a new class of numbers
setClass("newNumbers")
# Define substitution of dots to nothing
setAs("character", "newNumbers", function(from) as.numeric(gsub("\\.", "", from)))
# Now read
str(data.table::fread( "test \n 1.235 \n 1.265", colClasses = "newNumbers"))
# Classes ‘data.table’ and 'data.frame': 2 obs. of 1 variable:
# $ test: num 1235 1265
CodePudding user response:
Solution proposed by Quinten will work; however, it's worth adding that function which is designed to process numbers with a grouping mark is col_number
.
with(asNamespace("readr"),
read_delim(
I("X-ray hours\n---\n2.771\n3.778\n3,21\n"),
delim = ";",
col_names = c("x_ray_hours"),
col_types = cols(x_ray_hours = col_number()),
na = c("---"),
skip = 1
))
There is no need to define specific locale to handle this specific case only. Also locale setting will apply to the whole data and intention in this case to handle only that specific column. From docs:
?readr::parse_number
This drops any non-numeric characters before or after the first number.
Also if the columns use ;
as a separator, read_delim
is more appropriate.