Home > Software design >  Reading .txt files with no separators between Columns (column's names) and Rows (column's
Reading .txt files with no separators between Columns (column's names) and Rows (column's

Time:01-19

I have several .txt files that I would like to read and then rbind in R. I expect that each .txt file generate 1 line and 115 columns. First problem: I’m facing the following Warning message: “incomplete final line found by readTableHeader on…” But I have several files and I can’t navigate to the last line of each file and press Enter. Some solutions I found in the Internet didn’t work because of the following second problem.

Second problem: the column names (Columns) and the content of the columns (Rows) have no separator. The .txt files looks like this: "DIARREIA":1,"DISPNEIA":2, note that "DIARREIA" and "DISPNEIA" are column names while 1 and 2 are column contents. There are colon (:) between the name of the column and the content of the column.

Here is my code and 2 files as examples are available at https://drive.google.com/drive/folders/16U8J12Ld7PI5DI-ph_2QCysTxFGKZ-QP?usp=share_link.

````setwd("C:/User/BOX")
    unzip("C:/User/BOX/data.zip")
    list.files()
    temp = list.files(pattern = "*.txt")
    df = do.call("rbind", lapply(temp, function(x) read.table(x, stringsAsFactors = T, header = TRUE)))```

Any help, please? Thanks in advance!

CodePudding user response:

Hello Baptista: install jsonlite if you dont installed it and try this:

# this line installs jsonlite
if(!("jsonlite" %in% installed.packages())) install.packages("jsonlite")

setwd("C:/User/BOX")
unzip("C:/User/BOX/data.zip")
temp <- list.files(pattern = "*.txt")
df <- do.call("rbind", lapply(temp, jsonlite::read_json))

CodePudding user response:

You've found yourself some Debian Control File medical records. ?read.dcf and the explanation of a properly formed .dcf file. You can get this result

subject1_2_4
  subject PERDADEPALADAR1 PERDADEPALADAR ALTOFLUXOCATETERNASAL
1       1    false, false              1                 false
2       2              NA              2                 false
  INSUFICINCIARENAL1           DATADEALTADAUTI          DATADEADMISSOUTI
1              false                                                    
2              false 9\\/17\\/2020 12:00:00 AM 9\\/12\\/2020 12:00:00 AM
  IMUNOMODULADORQUAIS                DATADAALTA SITUAODOCASODESRAG DIARREIA
1                     10\\/6\\/2020 12:00:00 AM                  0        1
2                     9\\/19\\/2020 12:00:00 AM                  1        2
  DESFECHODOPARTO CLOROQUINAHIDROXICLOROQUINA LINFOCITOPENIA1
1              -1                       false           false
2              -1                       false           false
  OUTROSSINTOMASPERSISTENTES   PO2 DISPNEIA OXIGENOTERAPIA
1                  Ansiedade false        2           true
2                            false        1           true
  INSUFICINCIARESPIRATRIA PROFISSIONALDESADE TRIGLICRIDES FERRITINA1
1                       0                  2        false      false
2                       1                  0        false      false
               DATAADMISSAO TOSSE1 DOENAHEMATOLGICACRNICA DDIMERO1 PARTO
1 9\\/24\\/2020 12:00:00 AM  false                  false    false     0
2 9\\/16\\/2020 12:00:00 AM  false                  false     true     0
  COINFECOES SNDROMEDEDOWN PERDADEOLFATO DIABETESMELLITUS RENDAFAMILIAR
1          1         false             1             true              
2          1         false             2            false              
  SATURAOO2 VENTILAOMECNICAINVASIVA DDIMERO
1        96                   false   false
2        96                   false    true
                                 ANTIBITICOSQUAISETEMPODEUSO
1        Ceftriaxona 2g 24\\/24h 3d\nTazocin 4.5mg 6\\/6h 7d
2 Azitromicina 500mg 24\\/24h 5d\nCeftriaxona 1g 24\\/24h 7d
  TRABALHODEPARTOPREMATURO VENTILAOMECNICAEMPOSIOPRONA OUTRASCAUSASDEADMISSOUTI
1                        0                       false                         
2                        0                       false                         
  OUTRASSEQUELAS DATARESULTADOCONFIRMATRIOPARACOVID TOSSE DOENCAHEPTICACRNICA
1                          8\\/1\\/2020 12:00:00 AM     2               false
2                         9\\/17\\/2020 12:00:00 AM     1               false
  PROTENACREATIVA1 ARTRALGIADORNASARTICULAES ENCAMINHAMETODEOUTROSERVIO  ASMA
1            false                     false                          2 false
2            false                     false                          2 false
  TRIMESTREDEGESTACAO  PO21 INSUFICINCIARESPIRATRIA1 TIPODEPARTO OBESIDADE
1                     false                    false          -1     false
2                     false                     true          -1     false
  FRAQUEZA                OUTROS VOMITO DHLLDL1 IVERMECTINA
1    false         Febre\ncoriza      1   false       false
2    false Piora do quadro geral      2   false       false
  DIAGNSTICOCLNICOINICIAL ADMISSOUTI ALTOFLUXOMASCARA VITAMINAC FADIGA
1       Pneumonia e COVID          2            false     false      2
2       Pneumonia e COVID          1             true     false      2
  PROTENACREATIVA VITAMINAD       QUAISCOINFECES IMUNODEFICINCIA COCLHICINA
1           false     false            Pneumonia           false      false
2           false     false Pneumonia bacteriana           false      false
  ONDEFOIREALIZADOOPRIMEIROATENDIMENTODOPACIENTE
1                                              6
2                                              6
  ANTICOAGULANTEQUAISETEMPODEUSO1 CONTATODE FALNCIADERGOS SEPSE PERDADEOLFATO1
1       Clexane 40mg 24\\/24h 12d         0         false     0          false
2        Clexane 40mg 24\\/24h 7d         1         false     0          false
  INSUFICINCIARENAL EXPOSICAO DORABDOMINAL CHOQUE TCNAINTERNAO
1                 0        -1            2  false            0
2                 0        -1            2  false            2
  DESCONFORTORESPIRATRIO DHLLDL ANTIVIRAISQUAISETEMPODEUSO NITAXOZANIDA
1                      2  false                                   false
2                      2  false                                   false
                       DATA SEPSE1 DOENANEUROLGICACRNICA ZINCO PACIENTEGESTANTE
1 8\\/27\\/2022 12:00:00 AM  false                 false false                0
2 8\\/26\\/2022 12:00:00 AM  false                 false false                0
  OUTROSSINAISDEGRAVIDADE TIPODEEXAME DOENCACARDIOVASCULARCRNICA
1                                   0                       true
2                                   0                      false
  PARALISIADEDOENTECRTICO DOENARENALCRNICA1 TEMPERATURA
1                   false             false       36\n9
2                   false             false       36\n5
  FATORESDERISCOPARAGRAVIDADEEMGESTANTE INSUFICINCIACARDACA TRIGLICRIDES1
1                                    -1               false         false
2                                    -1               false         false
  FALTADEAR AMNSIAESQUECIMENTO   CORTICOIDESQUAISETEMPODEUSO LINFOCITOPENIA
1     false              false Dexametasona 6mg 24\\/24h 10d          false
2     false              false  Dexametasona 6mg 24\\/24h 7d          false
  OUTRAPNEUMOPATIACRNICA DORDEGARGANTA DESFECHOCLNICODOPACIENTE FIBROSEPULMONAR
1                  false             2                        1           false
2                  false             2                        1           false
  BAIXOFLUXOCATETERNASAL RACA MIALGIADORNOCORPO DOENARENALCRNICA FERRITINA SEXO
1                   true   -1             false            false     false    0
2                   true   -1             false            false     false    0
  PARADACARDIORRESPIRATRIA MIALGIA PURPERA ESPECTROCLNICOADMISSO TROMBOSE
1                    false       2   false                     1    false
2                    false       2   false                     1    false
  ENDERECOTIPO
1            0
2            0
> 

But there is a certain amount of mucking around to do, that can be done in R, likely easier in a text editor. With the .dcf rules in mind, we might (having already copied and pasted subject1 and subject2 into one text file)

subject1_2_step1 <- gsub('\\{', '', subject1_2)
subject1_2_step2 <- gsub('\\}', '', subject1_2)
subject1_2_step3 <- gsub(',', '\n', subject1_2)
subject1_2_step4_dcf <- read.dcf(textConnection(subject1_2_step3), all = TRUE)
Error in read.dcf(textConnection(subject1_2_step3), all = TRUE) : 
  Invalid DCF format.
Regular lines must have a tag.
Offending lines start with:
  list(c("false
  9\"
  "false
  5\"
  ))

It is easier to see in a text editor that these (9 and 5) are continuations of the prior tag:value pair, perhaps a clinician criticality indication, and should have a space before them. You could regex, find them and put the spaces, and in the end you still wouldn't have subject:1, or subject:2, as seen above because those aren't in the records, they're the file names. The same could likely be said for jsonlite. And replaced all '"' with '' for easier column name reading.

  • Related