Irregular data importing in R-CodePudding

I am trying to import an excel file, which is irregular in the first column data in R. It indicates the name of countries as the below list but some countries have space(s) in the middle of the name (ex. United Kingdom, Hong Kong SAR). How should I code to import all countries to be on the first column regardless of this irregularity?

Country  Rank  Score    
Switzerland  1  5.72    
Singapore  2  5.67    
Finland  3  5.55    
Sweden  4  5.53    
Netherlands  5  5.5    
Germany  6  5.48    
United  States  7  5.47  
United  Kingdom  8  5.45  
Hong  Kong  SAR  9  5.41

library (readxl)
library(stringr)

data <- read_excel("score.xlsx", sheet = "1")
View(data)

CodePudding user response：

readr::melt_csv() or meltr::melt_csv() can be useful for ragged or otherwise messed-up data like this. It reads in every value in the data as its own row, along with its row and column positions and likely type.

library(tidyverse)
library(readxl)

# to use melt_csv, first have to convert from .xlsx
# we'll also save the column names for later
baddata <- read_excel("scores.xlsx")
colnames <- names(baddata)
write_csv(baddata, "scores_bad.csv")

baddata <- melt_csv("scores_bad.csv", skip = 1)

baddata
# # A tibble: 45 x 4
#      row   col data_type value      
#    <dbl> <dbl> <chr>     <chr>      
#  1     1     1 character Switzerland
#  2     1     2 integer   1          
#  3     1     3 double    5.72       
#  4     1     4 missing   NA         
#  5     1     5 missing   NA         
#  6     2     1 character Singapore  
#  7     2     2 integer   2          
#  8     2     3 double    5.67       
#  9     2     4 missing   NA         
# 10     2     5 missing   NA         
# # ... with 35 more rows

Where you go from here will depend on the structure of your data and where the problems are. In this case, we can (1) drop missings, (2) collapse all character values in each row into one value, (3) add in the column names, (4) pivot, and (5) change columns to the appropriate type via readr::type_convert().

baddata %>% 
  drop_na(value) %>% 
  group_by(row, col = if_else(data_type == "character", 1, col)) %>% 
  summarize(value = paste(value, collapse = " ")) %>% 
  mutate(name = c(colnames[1:3])) %>% 
  ungroup() %>% 
  pivot_wider(id_cols = row) %>% 
  select(!row) %>% 
  type_convert()

# # A tibble: 9 x 3
#   Country        Rank  Score
#   <chr>          <dbl> <dbl>
# 1 Switzerland    1     5.72 
# 2 Singapore      2     5.67 
# 3 Finland        3     5.55 
# 4 Sweden         4     5.53 
# 5 Netherlands    5     5.5  
# 6 Germany        6     5.48 
# 7 United States  7     5.47 
# 8 United Kingdom 8     5.45 
# 9 Hong Kong SAR  9     5.41

CodePudding user response：

One way to do it is by making your own CSV. One can split the data in the correct locations, join them with a delimiter, and read the data in as a CSV.

library(readxl)
library(stringr)

# Read in the data
data <- read_excel("score.xlsx", sheet = "1")

# Convert to a space separated table, 'txt' variable is a list of strings
txt = capture.output(
  write.table(data, sep=" ", quote=FALSE, row.names=FALSE, col.names=FALSE))

# Get rows for new data frame, by adding delimiters row by row
data_rows <- lapply(txt, function(cur_row){
  # Split by every space between a digit/letter and digit
  row_split <- unlist(str_split(cur_row, "(?<=[A-Za-z0-9]) (?=[0-9])"))
  # Set the delimiter character to a semicolon
  out_txt <- str_c(row_split, collapse=";")
  # Remove "NA" and extra spaces
  out_txt <- str_replace_all(out_txt, " NA", " ")
  out_txt <- str_trim(out_txt)
})

# Convert to final data frame
final_data <- read.csv(text=unlist(data_rows),
         sep=";", header=FALSE,
         col.names=c("Country","Rank","Score"))