Home > Enterprise >  Find characters with only dashes or only periods in r
Find characters with only dashes or only periods in r

Time:06-28

I have a big dataframe (70k rows by 200k columns) with some of the row names having dashes, some having periods, and some having both, something like this:

df <- data.frame(cell1 = c(0,1,2,3,4,5,6), cell2 = c(0,1,2,3,4,5,6))
rownames(df) <- c("CMP21-97G8.1", "RP11-34P13.7", "HLA.A", "HLA-A", "HLA-E", "HLA.E", "RP11.442N24--B.1")

                   cell1 cell2
CMP21-97G8.1         0     0
RP11-34P13.7         1     1
HLA.A                2     2
HLA-A                3     3
HLA-E                4     4
HLA.E                5     5
RP11.442N24--B.1     6     6

I want to make three df subgroups where one subgroup has the rownames with only periods (HLA.A/HLA.E), one with dash-only rownames (HLA-A/HLA-E), and one with both (CMP21-97G8.1/RP11-34P13.7/RP11.442N24--B.1). Something like this:

df1
                 cell1 cell2
CMP21-97G8.1         0     0
RP11-34P13.7         1     1
RP11.442N24--B.1     6     6

df2
                 cell1 cell2
HLA.A                2     2
HLA.E                5     5

df3
                 cell1 cell2
HLA-A                3     3
HLA-E                4     4

When I try to look for periods and dashes though, they always seem to be "lazy", as in, it just looks to see if it has a period or a dash and it doesn't discriminate against cases with both.

#looking for either or. Returns all types mentioned
df <- df[grepl("[-]|[.]",rownames(df)),]
#tries to look for only containing both. Returns all types mentioned
df <- df[grepl("[^-]*-([^.] ).*",rownames(df)),]
#returns nothing
df <- df[grepl("[-]&[.]",rownames(df)),]
df <- df[grepl("[-]&&[.]",rownames(df)),]

Hopefully this makes sense and thanks for reading!

CodePudding user response:

You can use the following to get the first dataframe:

df1 <- df[grepl("-[^.]*\\.|\\.[^-]*-",rownames(df)),]

Output:

> df1
                 cell1 cell2
CMP21-97G8.1         0     0
RP11-34P13.7         1     1
RP11.442N24--B.1     6     6

The -[^.]*\\.|\\.[^-]*- regex matches two substrings, either a string between - and . or between . and -.

The second dataframe can be obtained with:

df2 <- df[grepl("^[^-.]*\\.[^-]*$", rownames(df)),]

Here, ^[^-.]*\.[^-]*$ matches a full string that contains no hyphens and at least one dot.

See the output:

> df2
      cell1 cell2
HLA.A     2     2
HLA.E     5     5

And the following to get the third dataframe:

df3 <- df[grepl("^[^-.]*-[^.]*$", rownames(df)),]

See the output:

> df3
      cell1 cell2
HLA-A     3     3
HLA-E     4     4

Here, ^[^-.]*-[^.]*$ matches a full string that contains no dots and at least one hyphen.

  • Related