Home > Software design >  Unite the columns of the dataframe based on the type of character they contain
Unite the columns of the dataframe based on the type of character they contain

Time:10-28

So I have the following dataframe of the salaries of former NBA players. The original data was on a .txt file, that's why my dataframe look like this:

  V1         V2         V3          V4         V5         V6
Vinnie     Del        Negro       .....      150000
Moses      Malone     .....                  75000
Kevin      Willis     ......      150000
Joe        Barry      Carrol      .....                 750000
Doc        Rivers     ....                              230000
Antoine    Carr       ......      85000
Hot        Rod        Williams    ...                   45000

This is just a glimpse of the real dataframe, which contains 350 rows and 7 columns (Non column is fully empty). What I want to do is to rearrange this dataframe in order to ultimately have just two columns, one corresponding to the name of the player and the other corresponding to his respective salary:

   name                 salary
Vinnie Del Negro        150000
Moses Malone            75000
Kevin Willis            150000
Joe Barry Carrol        750000
Doc Rivers              230000
Antoine Carr            85000
Hot Rod Williams        45000

Maybe this is very simple, but I've been trying to work around it for a couple of hours now and I can't seem to get any closer. I think the best approach would be something that puts in the first column all the alphabetic cells and in the second column all the numeric cells. The only issue with this is that since the original dataframe was a .txt file, all the columns are expressed as.characters().

Hope you can help me!

Thanks

CodePudding user response:

We can use unite from tidyr along with coalesce (assuming the ..... are NA values). If those are ..., then change it to NA with a mutate step (as below)

library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
df1 %>%
  mutate(across(everything(),
   ~ case_when(str_detect(., "^\\. $")|. == "" ~ NA_character_, 
           TRUE ~ as.character(.)))) %>%
  type.convert(as.is = TRUE) %>%
  unite(name, V1, V2, V3, na.rm = TRUE, sep = " ") %>%
  transmute(name, salary = invoke(coalesce, across(V4:V6)))

-output

              name salary
1 Vinnie Del Negro 150000
2     Moses Malone  75000
3     Kevin Willis 150000
4 Joe Barry Carrol 750000
5       Doc Rivers 230000
6     Antoine Carr  85000
7 Hot Rod Williams  45000

data

df1 <- structure(list(V1 = c("Vinnie", "Moses", "Kevin", "Joe", "Doc", 
"Antoine", "Hot"), V2 = c("Del", "Malone", "Willis", "Barry", 
"Rivers", "Carr", "Rod"), V3 = c("Negro", ".....", "......", 
"Carrol", "....", "......", "Williams"), V4 = c(".....", "", 
"150000", ".....", "", "85000", "..."), V5 = c("150000", "75000", 
"", "", "", "", ""), V6 = c("", "", "", "750000", "230000", "", 
"45000")), row.names = c(NA, -7L), class = "data.frame")

CodePudding user response:

Another solution. As asked for, I am going to explain the part of the used regex expression.

By using unite, I get a dataframe with only a column. The first row, after having executed the unite, is

Vinnie Del Negro ..... 150000 NA

So, the separating string is the dots delimited by spaces. Thus, \\s means space, \\.* means 0 or more dots, and finally the last space \\s.

library(tidyverse)  

df <- data.frame(
  stringsAsFactors = FALSE,
                      V1 = c("Vinnie","Moses",
                             "Kevin","Joe","Doc","Antoine","Hot"),
                      V2 = c("Del","Malone",
                             "Willis","Barry","Rivers","Carr","Rod"),
                      V3 = c("Negro",".....",
                             "......","Carrol","....","......","Williams"),
                V4 = c(".....", NA, "150000", ".....", NA, "85000", "..."),
                V5 = c(150000L, 75000L, NA, NA, NA, NA, NA),
                V6 = c(NA, NA, NA, 750000L, 230000L, NA, 45000L)
      )
  
df  %>% 
  unite("name",sep=" ") %>% 
  separate(name, c("name", "salary"), sep="\\s\\.*\\s") %>% 
  mutate(salary = parse_number(salary))
#>               name salary
#> 1 Vinnie Del Negro 150000
#> 2     Moses Malone  75000
#> 3     Kevin Willis 150000
#> 4 Joe Barry Carrol 750000
#> 5       Doc Rivers 230000
#> 6     Antoine Carr  85000
#> 7 Hot Rod Williams  45000
  •  Tags:  
  • r
  • Related