Home > Software engineering >  How to turn a textfile into a dataframe by importing the text file?
How to turn a textfile into a dataframe by importing the text file?

Time:03-05

I have an enormous text file that contains one long string that I'm trying to import into R as a data frame.

The text file containing the data is from

https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/new.data

Essentially the text file is a long string with values separated by spaces and I was wondering if I could convert all the spaces into commas using R so that I could then use read_csv?

I tried to import it as a tsv but it didn't work. When I tried to use read_delim it also didn't work because of how the text file is formatted.

Does anyone have any leads on how I could import this text file into a simple data frame? The first two rows from the text file are indicated below with the first row bolded to discern it from the second row.

1 15943882 63 1 -9 -9 -9 -27 1 145 1 233 -9 50 20 1 0 1 2 2 3 1981 0 0 0 0 0 1 10.5 6 13 150 60 190 90 145 85 0 0 2.3 3 -9 -9 0 -9 -9 -9 -9 -9 -9 6 -9 -9 -9 2 16 1981 0 1 1 1 -9 1 -9 1 -9 1 1 1 1 1 1 1 -9 -9 0 -9 -9 -9 -9 -9 -9 -9 -9 -9 0 0 0 0 name 2 15964847 67 1 -9 -9 -9 -27 4 160 1 286 -9 40 40 0 0 1 2 3 5 1981 0 1 0 0 0 1 9.5 6 13 108 64 160 90 160 90 1 0 1.5 2 -9 -9 3 -9 -9 -9 -9 -9 -9 3 -9 -9 -9 2 5 1981 2 1 2 2 -9 2 -9 1 -9 1 1 1 1 1 1 1 -9 -9 0 -9 -9 -9 -9 -9 -9 -9 -9 -9 0 0 0 0 name

Thank You!

CodePudding user response:

There's probably a much more efficient way, but this seems to do the trick. You'll have to name the columns yourself (the data doesn't seem to have column names).

library(dplyr)
library(tibble)
library(readr)

# Determined by looking at the file.  Not sure if 
# there's a way to determine this automatically
line_per_chunk <- 12L

# Read the whole file into a characater vector
data <- read_lines('new.data')

# Combine every group of 12 lines into a single string
# (using a space as a delimiter to match the rest of the file)
joined_data <- data %>%
  # Make the character vector a data frame, with a row number column
  enframe(name = 'row', value = 'raw_data') %>%
  # Based on https://stackoverflow.com/a/66732944/1714  
  group_by(chunk = (row -1) %/% line_per_chunk) %>%
  summarise(joined = paste(raw_data, collapse = ' '))

# Based on https://stackoverflow.com/a/8464885/1714
results <- read.table(textConnection(joined_data[["joined"]]), sep = ' ')

results
  • Related