Home > front end >  How to read data in "Debian control file" (DCF) format?
How to read data in "Debian control file" (DCF) format?

Time:10-24

Dataset

Please advice on the best way to read this type of data into a data frame in R.

Using read.table("Software.txt") only gives the error:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
line 1 did not have 6 elements.

Furthermore, this data (Amazon dataset) is not in the traditional rows and columns format, so would appreciate any help on that as well.

CodePudding user response:

Your data appears to be in the same "Debian control file" (DCF) format that's used to store package metadata. The correct import function for such data is

read.dcf("Software.txt")

Check out the ?read.dcf help page for more info.

CodePudding user response:

Here a solution based on readLines.

r1 <- readLines('~/Downloads/Software.txt')  ## read raw text
r2 <- r1[r1 != '']  ## remove blank elements, realize repeats every 10th
r3 <- strsplit(r2, ': ')  ## split at `: `
## remove part before `: ` and make matrix with 10 rows
r4 <- matrix(sapply(r3, `[`, 2), 10, dimnames=list(sapply(r3[1:10], `[`, 1), NULL))  
r5 <- as.data.frame(t(r4))  ## transpose and coerce to df
r6 <- setNames(r5, make.names(names(r5)))  ## names
r6[r6 == 'unknown'] <- NA  ## generate NA's 
r7 <- type.convert(r6, as.is=TRUE)  ## convert proper classes

You can, of course, streamline this a little. I just wanted to show you the individual steps.

Result

str(r7)  
# 'data.frame': 95084 obs. of  10 variables:
# $ product.productId : chr  "B000068VBQ" "B000068VBQ" "B000068VBQ" "B000068VBQ" ...
# $ product.title     : chr  "Fisher-Price Rescue Heroes" "Fisher-Price Rescue Heroes"  ...
# $ product.price     : num  8.88 8.88 8.88 8.88 8.88 8.88 8.88 NA NA NA ...
# $ review.userId     : chr  NA NA "A10P44U29RNOT6" NA ...
# $ review.profileName: chr  NA NA "D. Jones" NA ...
# $ review.helpfulness: chr  "11/11" "9/10" "6/6" "4/4" ...
# $ review.score      : num  2 2 1 1 4 5 1 4 5 4 ...
# $ review.time       : int  1042070400 1041552000 1126742400 1042416000 1045008000  ...
# $ review.summary    : chr  "Requires too much coordination" "You can't pick which  ...
# $ review.text       : chr  "I bought this software for my 5 year old. He has a couple ... 
  • Related