I have a big dataframe (70k rows by 200k columns) with some of the row names having dashes, some having periods, and some having both, something like this:
df <- data.frame(cell1 = c(0,1,2,3,4,5,6), cell2 = c(0,1,2,3,4,5,6))
rownames(df) <- c("CMP21-97G8.1", "RP11-34P13.7", "HLA.A", "HLA-A", "HLA-E", "HLA.E", "RP11.442N24--B.1")
cell1 cell2
CMP21-97G8.1 0 0
RP11-34P13.7 1 1
HLA.A 2 2
HLA-A 3 3
HLA-E 4 4
HLA.E 5 5
RP11.442N24--B.1 6 6
I want to make three df subgroups where one subgroup has the rownames with only periods (HLA.A
/HLA.E
), one with dash-only rownames (HLA-A
/HLA-E
), and one with both (CMP21-97G8.1
/RP11-34P13.7
/RP11.442N24--B.1
). Something like this:
df1
cell1 cell2
CMP21-97G8.1 0 0
RP11-34P13.7 1 1
RP11.442N24--B.1 6 6
df2
cell1 cell2
HLA.A 2 2
HLA.E 5 5
df3
cell1 cell2
HLA-A 3 3
HLA-E 4 4
When I try to look for periods and dashes though, they always seem to be "lazy", as in, it just looks to see if it has a period or a dash and it doesn't discriminate against cases with both.
#looking for either or. Returns all types mentioned
df <- df[grepl("[-]|[.]",rownames(df)),]
#tries to look for only containing both. Returns all types mentioned
df <- df[grepl("[^-]*-([^.] ).*",rownames(df)),]
#returns nothing
df <- df[grepl("[-]&[.]",rownames(df)),]
df <- df[grepl("[-]&&[.]",rownames(df)),]
Hopefully this makes sense and thanks for reading!
CodePudding user response:
You can use the following to get the first dataframe:
df1 <- df[grepl("-[^.]*\\.|\\.[^-]*-",rownames(df)),]
Output:
> df1
cell1 cell2
CMP21-97G8.1 0 0
RP11-34P13.7 1 1
RP11.442N24--B.1 6 6
The -[^.]*\\.|\\.[^-]*-
regex matches two substrings, either a string between -
and .
or between .
and -
.
The second dataframe can be obtained with:
df2 <- df[grepl("^[^-.]*\\.[^-]*$", rownames(df)),]
Here, ^[^-.]*\.[^-]*$
matches a full string that contains no hyphens and at least one dot.
See the output:
> df2
cell1 cell2
HLA.A 2 2
HLA.E 5 5
And the following to get the third dataframe:
df3 <- df[grepl("^[^-.]*-[^.]*$", rownames(df)),]
See the output:
> df3
cell1 cell2
HLA-A 3 3
HLA-E 4 4
Here, ^[^-.]*-[^.]*$
matches a full string that contains no dots and at least one hyphen.