I'm stuck trying to get a substring from a list of strings using pandas. Basically, the application returns data in this way:
['com.server.application.service.sprint.Sprint@20137b52[id=8837,rapidViewId=7061,state=CLOSED,name=name_of_the_sprint_1,startDate=2022-02-21T13:07:00.000Z,endDate=2022-03-11T13:07:00.000Z,completeDate=2022-03-14T17:19:29.271Z,activatedDate=2022-02-21T20:57:03.111Z,sequence=8837,goal=,autoStartStop=false]', 'com.server.application.service.sprint.Sprint@5fcc83c9[id=8919,rapidViewId=7061,state=CLOSED,name=name_of_the_sprint_2,startDate=2022-03-14T14:52:00.000Z,endDate=2022-04-01T14:52:00.000Z,completeDate=2022-04-04T18:25:08.141Z,activatedDate=2022-03-14T20:52:24.680Z,sequence=8919,goal=,autoStartStop=false]']
This list has two items and what I'm trying to do is to get the name of the sprint name_of_the_sprint_1
and name_of_the_sprint_2
that are after the name=
.
What I did until now (I do not know if this is the best and only way to do it) is the following:
df['sprints'].iloc[idx][0].split(',')
so it creates a list where I can get the information I want. But I'll need to split it again (I'm gonna find 'name=name_of_the_sprint_1'
in this sublist) in order to get only the name I want and need.
Is there a better way extract this information from my dataframe? I'll need to iterate over a dataframe with 3500 rows and do it for each item.
Thanks, folks for the help.
CodePudding user response:
First thing that comes to mind would be to slice the string, starting after the = and ending at the ,. If the list of lists was named data
, it might look like this:
data = ["whatever items, not important, name=your_thing_name, some more random stuff,", "even more random stuff, name=a_different_name, some more random things"]
for d in data:
sub = d.index("name") 5
val = d[sub:sub d[sub:].index(",")]
As far as performance goes, I ran this and the total time measured about 0.2 seconds
from time import perf_counter as pc
start = pc()
data = []
for i in range(3500):
data.append(f"this, things, name={i}_loop, very cool, ik")
for d in data:
sub = d.index("name") 5
val = d[sub:sub d[sub:].index(",")]
print(val)
print(pc() - start)
CodePudding user response:
A nested for
loop will be useful if you arrange your code neatly, I have tried this with 7000 rows of your data:
def function(df):
result = []
for i in df['sprints']:
split_string = i.split(',')
for row in split_string:
if 'name=' in row:
aa = row[5:]
result.append(aa)
return result
%timeit function()
14.4 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Extra
I've just realized that since you have known the keyword you wish to seek, you can just use re.search
to get your output:
def function(df):
return [re.search('name_of_the_sprint_' r"(\d )",row).group() for row in df['sprints']]
%timeit function(df)
10.9 ms ± 328 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Or if there's different names after the name=
, you can try this:
result = [re.search('name=' '\w ',row).group()[5:] for row in df[0]]