data loss with read.table() in R studio-CodePudding

Struggling with data loss when using read.table in R

I downloaded the entire World Checklist of Vascular Plant Database version 9:

http://sftp.kew.org/pub/data-repositories/WCVP/

unzip the file and to get wcvp_v9_jun_2022.txt and use control F to search "Corymbia", and you will find many rows of data where genus=="Corymbia", the same is also true for genus=="Eucalyptus" and genus=="Angophora"

imported it into R studio with the following line

WCVP <- read.table("wcvp_v9_jun_2022.txt",sep = "|", fill = T, header = T)

and check for the data

WCVP[WCVP$genus=="Corymbia",]

WCVP[WCVP$genus=="Eucalyptus",]

WCVP[WCVP$genus=="Angophora",]

I got the response

 WCVP[WCVP$genus=="Corymbia",]
 [1] kew_id           family           genus            species         
 [5] infraspecies     taxon_name       authors          rank            
 [9] taxonomic_status accepted_kew_id  accepted_name    accepted_authors
[13] parent_kew_id    parent_name      parent_authors   reviewed        
[17] publication      original_name_id
<0 rows> (or 0-length row.names)

While data for the other 2 genera are intact and R spits out rows of data?

Why is the data for Genus Corymbia missing after the .txt is imported into R studio? is there a bug or how do I troubleshoot?

Many thanks

CodePudding user response：

How to troubleshoot:

Count the number of lines in the database file, and compare to the number of rows in WCVP. If they are the same (or off by one, because of the title row), then you have the data, but it is messed up somehow. If you have a lot fewer lines, then see 3 below.
What line number is "Corymbia" on in the text file? What is on that line in WCVP?
If lines are missing, figure out what is the first missing line, by comparing line n of the text file to line n-1 of the dataframe. Start with small n, and increase until you find something that's wrong, then zero in to find the first one. What is special about that line? A likely cause is that the formatting isn't what you expect, e.g. missing or extra delimiters.

CodePudding user response：

There are embedded single-quotes (singles, not always paired) in the data that are throwing off reading it in. Set quote="" and you should see all the data.

WCVP <- read.table("wcvp_v9_jun_2022.txt",
                   sep = "|", fill = TRUE, header = TRUE)
nrow(WCVP)
# [1] 605649
WCVP[WCVP$genus=="Corymbia",]
#  [1] kew_id           family           genus            species          infraspecies     taxon_name       authors          rank             taxonomic_status accepted_kew_id  accepted_name    accepted_authors parent_kew_id   
# [14] parent_name      parent_authors   reviewed         publication      original_name_id
# <0 rows> (or 0-length row.names)

WCVP <- read.table("wcvp_v9_jun_2022.txt",
                   sep = "|", fill = TRUE, header = TRUE, quote = "")
nrow(WCVP)
# [1] 1232931                                    ## DIFFERENT!

head(WCVP[WCVP$genus=="Corymbia",], 3)
#          kew_id    family    genus    species infraspecies          taxon_name                                     authors    rank taxonomic_status accepted_kew_id accepted_name accepted_authors parent_kew_id parent_name
# 758307 986238-1 Myrtaceae Corymbia                                    Corymbia                    K.D.Hill & L.A.S.Johnson   GENUS         Accepted                                                                         
# 758308 986307-1 Myrtaceae Corymbia abbreviata              Corymbia abbreviata (Blakely & Jacobs) K.D.Hill & L.A.S.Johnson SPECIES         Accepted                                                     986238-1    Corymbia
# 758309 986248-1 Myrtaceae Corymbia  abergiana               Corymbia abergiana         (F.Muell.) K.D.Hill & L.A.S.Johnson SPECIES         Accepted                                                     986238-1    Corymbia
#                  parent_authors reviewed           publication original_name_id
# 758307                          Reviewed Telopea 6: 214 (1995)                 
# 758308 K.D.Hill & L.A.S.Johnson Reviewed Telopea 6: 344 (1995)         592646-1
# 758309 K.D.Hill & L.A.S.Johnson Reviewed Telopea 6: 244 (1995)         592647-1