Home > Net >  r Remove parts of column name after special characters
r Remove parts of column name after special characters

Time:12-06

Problem

I have a dataframe where I am trying to rename column entries that have multiple special characters, varying numbers of digits, and also include positive and negative numbers like shown in the example below.

Name  Number
A     -500--550
B     -600--650
C     -700--750
D     -8000--8500
E     -9000--9500
F     -100-200
G     200-400

These entries are date ranges and the middle hyphen is supposed to indicate "to", so "A" would be read as "negative 500 to negative 550"; "F" would be read as "negative 100 to (positive) 200"; and G would be read as "(200 to 400).

Having a "-" in the beginning of many entries, and "--" in the middle and different numbers of digits is making things a bit complicated. For my end results I would like to remove the "to" dash and everything after. The end results should look like this:

Name  Number
A     -500
B     -600
C     -700
D     -8000
E     -9000
F     -100
G      200

A dplyr approach would be great, but I'm not terribly picky as long as it works.

Similar Questions

I found some similar questions which came close to providing an answer, but the differences in the data sets have caused problems.

In this example they have differing number of digits after the dot ".", and use gsub to tackle the issue. Removing characters in column titles after "."

colnames(df) <- gsub("\\..*$", "", colnames(df))

In this other example they had multiple dots "." and wanted to delete the last ".". Remove (or replace) everything after a specified character in R strings

One of the methods used stringr as is shown below.

library(stringr)
str_remove(x, "\\.[^.]*$")

The problem here is that for many entries, I'd want to remove the second "-" onwards, but that doesn't work for rows "F" or "G"

str_remove(testing$Number, "\\--[^-]*$")
[1] "-500"     "-600"     "-700"     "-8000"    "-9000"    "-100-200" "200-400" 

Sample Data

I've provided a sample test set below.

structure(list(Name = c("A", "B", "C", "D", "E", "F", "G"), Number = c("-500--550", 
"-600--650", "-700--750", "-8000--8500", "-9000--9500", "-100-200", 
"200-400")), class = "data.frame", row.names = c(NA, -7L))

CodePudding user response:

I would replace on the pattern - \d $:

testing$Number <- sub("- \\d $", "", testing$Number)

Here is a working regex demo.

The regex used here says to match:

  • - one or more dashes
  • \d followed by one or more digits
  • $ the end of the value
  • Related