I have a data text file(17 columns) that i want to read in R. I'm using the read.table() function.
read.table(file="data.txt", header = TRUE, sep = "\t", quote = "",comment.char="")
The problem is that some of the rows take multiple lines(example below)
10 Macron serait-il plus pro-salafiste que Hamon?!
t.co/g29oOgqih1
#Presidentielle2017 FALSE 0 NA 2017-03-02 13:45:08 FALSE NA 837297724378726400 NA <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a> Trader496 0 FALSE FALSE NA NA
Is there any way to read this type of data in a single row or do i have to use fill=TRUE
Data File: https://pastebin.com/b90VHvSt
CodePudding user response:
The readr::melt_*()
or meltr::melt_*()
functions are useful for misformatted data. This can be a very tedious task, so I'll demonstrate some of the functionality and workflow without completely cleaning this data.
This looks like it's tab-separated, so we'll start with melt_tsv()
:
library(readr)
library(dplyr)
library(tidyr)
data_raw <- melt_tsv("https://pastebin.com/raw/b90VHvSt")
data_raw
#> # A tibble: 281 × 4
#> row col data_type value
#> <dbl> <dbl> <chr> <chr>
#> 1 1 1 character no text
#> 2 1 2 character favorited
#> 3 1 3 character favoriteCount
#> 4 1 4 character replyToSN
#> 5 1 5 character created
#> 6 1 6 character truncated
#> 7 1 7 character replyToSID
#> 8 1 8 character id
#> 9 1 9 character replyToUID
#> 10 1 10 character statusSource
#> # … with 271 more rows
This reads in the data one token at a time, with information on location and data type. For starters, it looks like the first two column names are separated by spaces instead of a tab, so were read in as one token. We can fix this, then merge in the corrected headers to the rest of the data.
headers_fixed <- data_raw %>%
filter(row == 1, col != 1) %>%
mutate(col = col 1) %>%
select(col, col_name = value) %>%
add_row(col = c(1, 2), col_name = c("no", "text"), .before = 1)
data_raw <- data_raw %>%
filter(row != 1) %>%
left_join(headers_fixed) %>%
add_count(row, name = "row_cols")
I also added a count variable showing the number of columns per row. Each row should have 17 columns, so we can use this to filter and pivot the "good" rows.
data_ok <- data_raw %>%
filter(row_cols == 17) %>%
select(row, col_name, value) %>%
pivot_wider(names_from = col_name) %>%
type_convert()
data_ok
#> # A tibble: 12 × 18
#> row no text favor…¹ favor…² reply…³ created trunc…⁴ reply…⁵
#> <dbl> <dbl> <chr> <lgl> <dbl> <lgl> <dttm> <lgl> <lgl>
#> 1 2 1 "RT … FALSE 0 NA 2017-03-02 13:45:34 FALSE NA
#> 2 3 2 "Ne … FALSE 0 NA 2017-03-02 13:45:32 FALSE NA
#> 3 4 3 "Il … FALSE 0 NA 2017-03-02 13:45:29 FALSE NA
#> 4 5 4 "RT … FALSE 0 NA 2017-03-02 13:45:26 FALSE NA
#> 5 6 5 "RT … FALSE 0 NA 2017-03-02 13:45:26 FALSE NA
#> 6 7 6 "RT … FALSE 0 NA 2017-03-02 13:45:25 FALSE NA
#> 7 8 7 "RT … FALSE 0 NA 2017-03-02 13:45:13 FALSE NA
#> 8 9 8 "#Pr… FALSE 0 NA 2017-03-02 13:45:10 FALSE NA
#> 9 10 9 "#Pr… FALSE 0 NA 2017-03-02 13:45:10 FALSE NA
#> 10 16 11 "RT … FALSE 0 NA 2017-03-02 13:44:58 FALSE NA
#> 11 21 13 "RT … FALSE 0 NA 2017-03-02 13:44:46 FALSE NA
#> 12 26 15 "Dim… FALSE 0 NA 2017-03-02 13:44:41 FALSE NA
#> # … with 9 more variables: id <dbl>, replyToUID <lgl>, statusSource <chr>,
#> # screenName <chr>, retweetCount <dbl>, isRetweet <lgl>, retweeted <lgl>,
#> # longitude <lgl>, latitude <lgl>, and abbreviated variable names ¹favorited,
#> # ²favoriteCount, ³replyToSN, ⁴truncated, ⁵replyToSID
This leaves us with 61 values in 13 "bad" rows. Diagnosing and fixing these will take more work, which is left as an exercise for the reader.
data_bad <- data_raw %>%
filter(row_cols != 17)
data_bad
#> # A tibble: 61 × 6
#> row col data_type value col_n…¹ row_c…²
#> <dbl> <dbl> <chr> <chr> <chr> <int>
#> 1 11 1 integer 10 no 2
#> 2 11 2 character Macron serait-il plus pro-salafiste qu… text 2
#> 3 12 1 character <url shortener rmvd> no 1
#> 4 13 1 missing <NA> no 1
#> 5 14 1 missing <NA> no 1
#> 6 15 1 character #Presidentielle2017 no 16
#> 7 15 2 logical FALSE text 16
#> 8 15 3 integer 0 favori… 16
#> 9 15 4 missing <NA> favori… 16
#> 10 15 5 datetime 2017-03-02 13:45:08 replyT… 16
#> # … with 51 more rows, and abbreviated variable names ¹col_name, ²row_cols
Created on 2022-11-09 with reprex v2.0.2
CodePudding user response:
This work in your data sample:
data <- readLines("b90VHvSt.txt")
data <- paste(data, collapse = " ")
data <- gsub("(([^\\t]*\\t){15}[^ ] ) ", "\\1\t", data, perl = T)
data <- unlist(strsplit(data, "\t"))
data <- append(data, "?", 9)
data <- matrix(data, nrow = 17)
data <- as.data.frame(t(data[,-1]), row.names = data[,1] ))