Home > database >  Importing data using R from SQL Server truncate leading zeros
Importing data using R from SQL Server truncate leading zeros

Time:10-09

I'm trying to import data from a table in SQL Server and then write it into a .txt file. I'm doing it in the following way. However when I do that all numbers having leading 0 s seems to get trimmed.

For example if I have 000124 in the database, it's shown as 124 in the .txt as well as if I check x_1 it's 124 in there as well.

How can I avoid this? I want to keep the leading 0 s in x_1 and also need them in the output .txt file.

library(RODBC)
library(lubridate)
library(data.table)

cn_1 <- odbcConnect('channel_name')
qry <- "
select
    *
from table_name
"
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE)
rm(qry)
setDT(x_1)
fwrite(x=x_1, file=paste0(export_location, "file_name", date_today, ".txt"), sep="|", quote=TRUE, row.names=FALSE, na="")

CodePudding user response:

Let x_1 be the result data.table from your SQL query. Then you can convert numeric columns (e.g. value) to formatted strings using sprintf to get leading zeros:

library(data.table)

x_1 <- data.table(value = c(1,12,123,1234))
x_1
#>    value
#> 1:     1
#> 2:    12
#> 3:   123
#> 4:  1234

x_1$value <- x_1$value |> sprintf(fmt = "d")
x_1
#>    value
#> 1:  0001
#> 2:  0012
#> 3:  0123
#> 4:  1234

Created on 2021-10-08 by the reprex package (v2.0.1)

CodePudding user response:

Assuming that the underlying data in the DBMS is indeed "string"-like ...

RODBC::sqlQuery has the as.is= argument that can prevent it from trying to convert values. The default is FALSE, and when false and not a clear type like "date" or "timestamp", RODBC calls type.convert which will see the number-like field and convert it to integers or numbers.

Try:

x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = TRUE)

and that will stop auto-conversion of all columns.

That is a bit nuclear, to be honest, and will stop conversion of dates/times, and perhaps other columns that should be converted. We can narrow this down; ?sqlQuery says that read.table's documentation on as.is is relevant, and it says:

   as.is: controls conversion of character variables (insofar as they
          are not converted to logical, numeric or complex) to factors,
          if not otherwise specified by 'colClasses'.  Its value is
          either a vector of logicals (values are recycled if
          necessary), or a vector of numeric or character indices which
          specify which columns should not be converted to factors.

so if you know which column (by name or column index) is being unnecessarily converted, then you can include it directly. Perhaps

## by column name
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = "somename")

## or by column index
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = 7)

(Side note: while I use select * ... on occasion as well, the presumption of knowing columns by-number is predicated on know all of the columns included in that table/query. If anything changes, perhaps it's actually a SQL view and somebody updates it ... or if somebody changes the order of columns, than your assumptions of column indices is a little fragile. All of my "production" queries in my internal packages have all columns spelled out, no use of select *. I have been bitten once when I used it, which is why I'm a little defensive about it.)

If you don't know, a hastily-dynamic way (that double-taps the query, unfortunately) could be something like

qry10 <- "
select
    *
from table_name
limit 10"
x_1 <- sqlQuery(channel=cn_1, query=qry10, stringsAsFactors=FALSE, as.is = TRUE)
leadzero <- sapply(x_1, function(z) all(grepl("^0 [1-9]", z)))
x_1 <- sqlQuery(channel=cn_1, query=qry, stringsAsFactors=FALSE, as.is = which(leadzero))

Caveat: I don't use RODBC nor have I set up a temporary database with appropriately-fashioned values, so this untested.

  • Related