I am having a data frame (for example as below:
name student_id age gender
Sam 123_abc_ABC 20 F
John 234_bcd_BCD 18 M
Mark 345_cde_CDE 20 M
Ram xyz_111_XYZ 19 M
Hari uvw_444_UVW 23 M
Now, I need a new column as student_id_by_govt in the df. The student_id_by_govt is within the student_id but it is different for different names. For Sam, John, Mark the student_id_by_govt would be first segment of student_id (i.e., 123, 234, 345) but for Ram & Hari, the student_id_by_govt is second segment in the student_id (i.e.,111, 444).
I used the strsplit, lapply commands to get the specfic segment from the student_id but I could not able to apply that command specifically for specific rows to get the desired output mentioned above. Please let me know how to get the output as below:
name student_id age gender student_id_by_govt
Sam 123_abc_ABC 20 F 123
John 234_bcd_BCD 18 M 234
Mark 345_cde_CDE 20 M 345
Ram xyz_111_XYZ 19 M 111
Hari uvw_444_UVW 23 M 444
CodePudding user response:
You can use regex via the function str_extract from the library stringr:
library(dplyr)
library(stringr)
library(purrr)
df <- tibble(Name = c("Sam", "John", "Mark", "Ram", "Hari"), student_id = c("123_abc_ABC", "234_bcd_BCD", "345_cde_CDE", "xyz_111_XYZ", "uvw_444_UVW")) %>%
mutate(student_id_by_gvt = map_chr(student_id, function(x){str_extract(x, "(\\d )")}))
Here is the output:
# A tibble: 5 x 3
Name student_id student_id_by_gvt
<chr> <chr> <chr>
1 Sam 123_abc_ABC 123
2 John 234_bcd_BCD 234
3 Mark 345_cde_CDE 345
4 Ram xyz_111_XYZ 111
5 Hari uvw_444_UVW 444
I am more confortable the tidyverse package. I hope this solution will help you
CodePudding user response:
You only need str_extract
:
library(tidyverse)
df %>%
mutate(student_id_by_govt = str_extract(student_id, "\\d "))
# A tibble: 5 × 3
Name student_id student_id_by_govt
<chr> <chr> <chr>
1 Sam 123_abc_ABC 123
2 John 234_bcd_BCD 234
3 Mark 345_cde_CDE 345
4 Ram xyz_111_XYZ 111
5 Hari uvw_444_UVW 444
CodePudding user response:
Another option using parse_number
to extract all numbers from a string:
df <- read.table(text="name student_id age gender
Sam 123_abc_ABC 20 F
John 234_bcd_BCD 18 M
Mark 345_cde_CDE 20 M
Ram xyz_111_XYZ 19 M
Hari uvw_444_UVW 23 M", header = TRUE)
library(dplyr)
library(purrr)
library(stringr)
df %>%
mutate(student_id_by_govt = readr::parse_number(as.character(student_id)))
#> name student_id age gender student_id_by_govt
#> 1 Sam 123_abc_ABC 20 F 123
#> 2 John 234_bcd_BCD 18 M 234
#> 3 Mark 345_cde_CDE 20 M 345
#> 4 Ram xyz_111_XYZ 19 M 111
#> 5 Hari uvw_444_UVW 23 M 444
Created on 2022-07-01 by the reprex package (v2.0.1)
CodePudding user response:
I am not sure I understand what you want.
library(dplyr)
library(stringr)
library(purrr)
df <- tibble(Name = c("Sam", "John", "Mark", "Ram", "Hari"), student_id = c("123_abc_ABC", "234_596_BCD", "345_cde_CDE", "xyz_111_XYZ", "uvw_444_UVW"), Gender = c("F", "M", "M", "M", "M")) %>%
# mutate(student_id_by_gvt = if_else(Gender == "M", str_split(student_id, "_")[[1]][1], str_split(student_id, "_")[[1]][2]))
mutate(student_id_by_gvt = map2_chr(Gender, student_id, function(x,y){if_else(x == "M", str_split(y, "_")[[1]][1], str_split(y, "_")[[1]][2])}))
It gives the output:
Name student_id Gender student_id_by_gvt
<chr> <chr> <chr> <chr>
1 Sam 123_abc_ABC F abc
2 John 234_596_BCD M 234
3 Mark 345_cde_CDE M 345
4 Ram xyz_111_XYZ M xyz
5 Hari uvw_444_UVW M uvw
CodePudding user response:
since you changed the question in one of your follow up comments quite a bit ie to "Hi there, is there a code for getting student_id_by_govt column based on gender column? I mean for all the 'M' gender, the student_id_by_govt is first segment in the student_id (ie., 234, 345, xyz, uvw) but for 'F' gender I want the second segment of student_id (i.e., abc) as student_id_by_govt? Please note, the segments I want from student_id in my original data is not numeric."
here is a simple base R solution for the case that the segments are of equal
length and the positions of those segments are stable - in case the strings have differing length but have certain characters based on which you could identify the segments you could replace that substr
with some regex
function.
df$student_id_by_govt <- ifelse(df$gender == "M",
substr(df$student_id, 1,3), substr(df$student_id, 5,7))