So I have the following dataframe of the salaries of former NBA players. The original data was on a .txt file, that's why my dataframe look like this:
V1 V2 V3 V4 V5 V6
Vinnie Del Negro ..... 150000
Moses Malone ..... 75000
Kevin Willis ...... 150000
Joe Barry Carrol ..... 750000
Doc Rivers .... 230000
Antoine Carr ...... 85000
Hot Rod Williams ... 45000
This is just a glimpse of the real dataframe, which contains 350 rows and 7 columns (Non column is fully empty). What I want to do is to rearrange this dataframe in order to ultimately have just two columns, one corresponding to the name of the player and the other corresponding to his respective salary:
name salary
Vinnie Del Negro 150000
Moses Malone 75000
Kevin Willis 150000
Joe Barry Carrol 750000
Doc Rivers 230000
Antoine Carr 85000
Hot Rod Williams 45000
Maybe this is very simple, but I've been trying to work around it for a couple of hours now and I can't seem to get any closer. I think the best approach would be something that puts in the first column all the alphabetic cells and in the second column all the numeric cells. The only issue with this is that since the original dataframe was a .txt file, all the columns are expressed as.characters().
Hope you can help me!
Thanks
CodePudding user response:
We can use unite
from tidyr
along with coalesce
(assuming the .....
are NA
values). If those are ...
, then change it to NA
with a mutate
step (as below)
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
df1 %>%
mutate(across(everything(),
~ case_when(str_detect(., "^\\. $")|. == "" ~ NA_character_,
TRUE ~ as.character(.)))) %>%
type.convert(as.is = TRUE) %>%
unite(name, V1, V2, V3, na.rm = TRUE, sep = " ") %>%
transmute(name, salary = invoke(coalesce, across(V4:V6)))
-output
name salary
1 Vinnie Del Negro 150000
2 Moses Malone 75000
3 Kevin Willis 150000
4 Joe Barry Carrol 750000
5 Doc Rivers 230000
6 Antoine Carr 85000
7 Hot Rod Williams 45000
data
df1 <- structure(list(V1 = c("Vinnie", "Moses", "Kevin", "Joe", "Doc",
"Antoine", "Hot"), V2 = c("Del", "Malone", "Willis", "Barry",
"Rivers", "Carr", "Rod"), V3 = c("Negro", ".....", "......",
"Carrol", "....", "......", "Williams"), V4 = c(".....", "",
"150000", ".....", "", "85000", "..."), V5 = c("150000", "75000",
"", "", "", "", ""), V6 = c("", "", "", "750000", "230000", "",
"45000")), row.names = c(NA, -7L), class = "data.frame")
CodePudding user response:
Another solution. As asked for, I am going to explain the part of the used regex expression.
By using unite
, I get a dataframe with only a column. The first row, after having executed the unite
, is
Vinnie Del Negro ..... 150000 NA
So, the separating string is the dots delimited by spaces. Thus, \\s
means space, \\.*
means 0 or more dots, and finally the last space \\s
.
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
V1 = c("Vinnie","Moses",
"Kevin","Joe","Doc","Antoine","Hot"),
V2 = c("Del","Malone",
"Willis","Barry","Rivers","Carr","Rod"),
V3 = c("Negro",".....",
"......","Carrol","....","......","Williams"),
V4 = c(".....", NA, "150000", ".....", NA, "85000", "..."),
V5 = c(150000L, 75000L, NA, NA, NA, NA, NA),
V6 = c(NA, NA, NA, 750000L, 230000L, NA, 45000L)
)
df %>%
unite("name",sep=" ") %>%
separate(name, c("name", "salary"), sep="\\s\\.*\\s") %>%
mutate(salary = parse_number(salary))
#> name salary
#> 1 Vinnie Del Negro 150000
#> 2 Moses Malone 75000
#> 3 Kevin Willis 150000
#> 4 Joe Barry Carrol 750000
#> 5 Doc Rivers 230000
#> 6 Antoine Carr 85000
#> 7 Hot Rod Williams 45000