I have an operation I need to translate from dplyr
(and stringr
) in R to pandas
in python. It's quite simple in R but I haven't been able to wrap my head around it in pandas. Basically, I need to group by one (or more) columns, and then concatenate the remaining columns together and collapse them by a delimiter. R has the nicely vectorized str_c
function that does exactly what I want.
Here's the R code:
library(tidyverse)
df <- as_tibble(structure(list(file = c(1, 1, 1, 2, 2, 2), marker = c("coi", "12s", "16s", "coi", "12s", "16s"), start = c(1, 22, 99, 12, 212, 199), end = c(15, 35, 102, 150, 350, 1102)), row.names = c(NA, -6L), class = "data.frame") )
df %>%
group_by(file) %>%
summarise(markers = str_c(marker,"[",start,":",end,"]",collapse="|"))
#> # A tibble: 2 × 2
#> file markers
#> <dbl> <chr>
#> 1 1 coi[1:15]|12s[22:35]|16s[99:102]
#> 2 2 coi[12:150]|12s[212:350]|16s[199:1102]
Here's the beginning of the python code. I assume there's some trickery with agg
or transform
but I'm not sure how to combine and join the multiple columns:
from io import StringIO
import pandas as pd
s = StringIO("""
file,marker,start,end
1.f,coi,1,15
1.f,12s,22,35
1.f,16s,99,102
2.f,coi,12,150
2.f,12s,212,350
2.f,16s,199,1102
""")
df = pd.read_csv(s)
# ... now what? ...
CodePudding user response:
(df.astype(str)
.assign(markers = lambda df: df.marker "[" (df.start ":" df.end) "]")
.groupby('file', as_index=False)
.markers
.agg("|".join)
)
file markers
0 1.f coi[1:15]|12s[22:35]|16s[99:102]
1 2.f coi[12:150]|12s[212:350]|16s[199:1102]
The idea is to combine the columns first before grouping and aggregatiing with python's str.join method
CodePudding user response:
Create new column markers which concatenates marker and the last two columns separated by:
Groupby by file and concatenate the new column markers.
df['markers']=df['marker'] '[' (df.astype(str).iloc[:,2:].agg(list,1).str.join(':')) ']'
df.groupby('file')['markers'].apply(lambda x: x.str.cat(sep='|')).to_frame()
markers
file
1.f coi[1:15]|12s[22:35]|16s[99:102]
2.f coi[12:150]|12s[212:350]|16s[199:1102]