Home > front end >  How to read a poorly formated data text file in R?
How to read a poorly formated data text file in R?

Time:11-10

I have a data text file(17 columns) that i want to read in R. I'm using the read.table() function.

read.table(file="data.txt", header = TRUE, sep = "\t", quote = "",comment.char="")

The problem is that some of the rows take multiple lines(example below)

10  Macron serait-il plus pro-salafiste que Hamon?!
t.co/g29oOgqih1

#Presidentielle2017 FALSE   0   NA  2017-03-02 13:45:08 FALSE   NA  837297724378726400  NA  <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a> Trader496   0   FALSE   FALSE   NA  NA

Is there any way to read this type of data in a single row or do i have to use fill=TRUE

Data File: https://pastebin.com/b90VHvSt

CodePudding user response:

The readr::melt_*() or meltr::melt_*() functions are useful for misformatted data. This can be a very tedious task, so I'll demonstrate some of the functionality and workflow without completely cleaning this data.

This looks like it's tab-separated, so we'll start with melt_tsv():

library(readr)
library(dplyr)
library(tidyr)

data_raw <- melt_tsv("https://pastebin.com/raw/b90VHvSt")

data_raw

#> # A tibble: 281 × 4
#>      row   col data_type value        
#>    <dbl> <dbl> <chr>     <chr>        
#>  1     1     1 character no  text     
#>  2     1     2 character favorited    
#>  3     1     3 character favoriteCount
#>  4     1     4 character replyToSN    
#>  5     1     5 character created      
#>  6     1     6 character truncated    
#>  7     1     7 character replyToSID   
#>  8     1     8 character id           
#>  9     1     9 character replyToUID   
#> 10     1    10 character statusSource 
#> # … with 271 more rows

This reads in the data one token at a time, with information on location and data type. For starters, it looks like the first two column names are separated by spaces instead of a tab, so were read in as one token. We can fix this, then merge in the corrected headers to the rest of the data.

headers_fixed <- data_raw %>% 
  filter(row == 1, col != 1) %>% 
  mutate(col = col   1) %>% 
  select(col, col_name = value) %>% 
  add_row(col = c(1, 2), col_name = c("no", "text"), .before = 1)

data_raw <- data_raw %>% 
  filter(row != 1) %>% 
  left_join(headers_fixed) %>% 
  add_count(row, name = "row_cols")

I also added a count variable showing the number of columns per row. Each row should have 17 columns, so we can use this to filter and pivot the "good" rows.

data_ok <- data_raw %>% 
  filter(row_cols == 17) %>% 
  select(row, col_name, value) %>% 
  pivot_wider(names_from = col_name) %>% 
  type_convert()

data_ok

#> # A tibble: 12 × 18
#>      row    no text  favor…¹ favor…² reply…³ created             trunc…⁴ reply…⁵
#>    <dbl> <dbl> <chr> <lgl>     <dbl> <lgl>   <dttm>              <lgl>   <lgl>  
#>  1     2     1 "RT … FALSE         0 NA      2017-03-02 13:45:34 FALSE   NA     
#>  2     3     2 "Ne … FALSE         0 NA      2017-03-02 13:45:32 FALSE   NA     
#>  3     4     3 "Il … FALSE         0 NA      2017-03-02 13:45:29 FALSE   NA     
#>  4     5     4 "RT … FALSE         0 NA      2017-03-02 13:45:26 FALSE   NA     
#>  5     6     5 "RT … FALSE         0 NA      2017-03-02 13:45:26 FALSE   NA     
#>  6     7     6 "RT … FALSE         0 NA      2017-03-02 13:45:25 FALSE   NA     
#>  7     8     7 "RT … FALSE         0 NA      2017-03-02 13:45:13 FALSE   NA     
#>  8     9     8 "#Pr… FALSE         0 NA      2017-03-02 13:45:10 FALSE   NA     
#>  9    10     9 "#Pr… FALSE         0 NA      2017-03-02 13:45:10 FALSE   NA     
#> 10    16    11 "RT … FALSE         0 NA      2017-03-02 13:44:58 FALSE   NA     
#> 11    21    13 "RT … FALSE         0 NA      2017-03-02 13:44:46 FALSE   NA     
#> 12    26    15 "Dim… FALSE         0 NA      2017-03-02 13:44:41 FALSE   NA     
#> # … with 9 more variables: id <dbl>, replyToUID <lgl>, statusSource <chr>,
#> #   screenName <chr>, retweetCount <dbl>, isRetweet <lgl>, retweeted <lgl>,
#> #   longitude <lgl>, latitude <lgl>, and abbreviated variable names ¹​favorited,
#> #   ²​favoriteCount, ³​replyToSN, ⁴​truncated, ⁵​replyToSID

This leaves us with 61 values in 13 "bad" rows. Diagnosing and fixing these will take more work, which is left as an exercise for the reader.

data_bad <- data_raw %>% 
  filter(row_cols != 17) 

data_bad

#> # A tibble: 61 × 6
#>      row   col data_type value                                   col_n…¹ row_c…²
#>    <dbl> <dbl> <chr>     <chr>                                   <chr>     <int>
#>  1    11     1 integer   10                                      no            2
#>  2    11     2 character Macron serait-il plus pro-salafiste qu… text          2
#>  3    12     1 character <url shortener rmvd>                    no            1
#>  4    13     1 missing   <NA>                                    no            1
#>  5    14     1 missing   <NA>                                    no            1
#>  6    15     1 character #Presidentielle2017                     no           16
#>  7    15     2 logical   FALSE                                   text         16
#>  8    15     3 integer   0                                       favori…      16
#>  9    15     4 missing   <NA>                                    favori…      16
#> 10    15     5 datetime  2017-03-02 13:45:08                     replyT…      16
#> # … with 51 more rows, and abbreviated variable names ¹​col_name, ²​row_cols

Created on 2022-11-09 with reprex v2.0.2

CodePudding user response:

This work in your data sample:

data <- readLines("b90VHvSt.txt")
data <- paste(data, collapse = " ")      
data <- gsub("(([^\\t]*\\t){15}[^ ] ) ", "\\1\t", data, perl = T)
data <- unlist(strsplit(data, "\t"))
data <- append(data, "?", 9)
data <- matrix(data, nrow = 17)
data <- as.data.frame(t(data[,-1]), row.names = data[,1] ))

  • Related