Struggling with data loss when using read.table in R
I downloaded the entire World Checklist of Vascular Plant Database version 9:
http://sftp.kew.org/pub/data-repositories/WCVP/
unzip the file and to get wcvp_v9_jun_2022.txt and use control F to search "Corymbia", and you will find many rows of data where genus=="Corymbia", the same is also true for genus=="Eucalyptus" and genus=="Angophora"
imported it into R studio with the following line
WCVP <- read.table("wcvp_v9_jun_2022.txt",sep = "|", fill = T, header = T)
and check for the data
WCVP[WCVP$genus=="Corymbia",]
WCVP[WCVP$genus=="Eucalyptus",]
WCVP[WCVP$genus=="Angophora",]
I got the response
WCVP[WCVP$genus=="Corymbia",]
[1] kew_id family genus species
[5] infraspecies taxon_name authors rank
[9] taxonomic_status accepted_kew_id accepted_name accepted_authors
[13] parent_kew_id parent_name parent_authors reviewed
[17] publication original_name_id
<0 rows> (or 0-length row.names)
While data for the other 2 genera are intact and R spits out rows of data?
Why is the data for Genus Corymbia missing after the .txt is imported into R studio? is there a bug or how do I troubleshoot?
Many thanks
CodePudding user response:
How to troubleshoot:
Count the number of lines in the database file, and compare to the number of rows in
WCVP
. If they are the same (or off by one, because of the title row), then you have the data, but it is messed up somehow. If you have a lot fewer lines, then see 3 below.What line number is "Corymbia" on in the text file? What is on that line in
WCVP
?If lines are missing, figure out what is the first missing line, by comparing line n of the text file to line n-1 of the dataframe. Start with small n, and increase until you find something that's wrong, then zero in to find the first one. What is special about that line? A likely cause is that the formatting isn't what you expect, e.g. missing or extra delimiters.
CodePudding user response:
There are embedded single-quotes (singles, not always paired) in the data that are throwing off reading it in. Set quote=""
and you should see all the data.
WCVP <- read.table("wcvp_v9_jun_2022.txt",
sep = "|", fill = TRUE, header = TRUE)
nrow(WCVP)
# [1] 605649
WCVP[WCVP$genus=="Corymbia",]
# [1] kew_id family genus species infraspecies taxon_name authors rank taxonomic_status accepted_kew_id accepted_name accepted_authors parent_kew_id
# [14] parent_name parent_authors reviewed publication original_name_id
# <0 rows> (or 0-length row.names)
WCVP <- read.table("wcvp_v9_jun_2022.txt",
sep = "|", fill = TRUE, header = TRUE, quote = "")
nrow(WCVP)
# [1] 1232931 ## DIFFERENT!
head(WCVP[WCVP$genus=="Corymbia",], 3)
# kew_id family genus species infraspecies taxon_name authors rank taxonomic_status accepted_kew_id accepted_name accepted_authors parent_kew_id parent_name
# 758307 986238-1 Myrtaceae Corymbia Corymbia K.D.Hill & L.A.S.Johnson GENUS Accepted
# 758308 986307-1 Myrtaceae Corymbia abbreviata Corymbia abbreviata (Blakely & Jacobs) K.D.Hill & L.A.S.Johnson SPECIES Accepted 986238-1 Corymbia
# 758309 986248-1 Myrtaceae Corymbia abergiana Corymbia abergiana (F.Muell.) K.D.Hill & L.A.S.Johnson SPECIES Accepted 986238-1 Corymbia
# parent_authors reviewed publication original_name_id
# 758307 Reviewed Telopea 6: 214 (1995)
# 758308 K.D.Hill & L.A.S.Johnson Reviewed Telopea 6: 344 (1995) 592646-1
# 758309 K.D.Hill & L.A.S.Johnson Reviewed Telopea 6: 244 (1995) 592647-1