Get a substring from a string with python-CodePudding

I'm stuck trying to get a substring from a list of strings using pandas. Basically, the application returns data in this way:

['com.server.application.service.sprint.Sprint@20137b52[id=8837,rapidViewId=7061,state=CLOSED,name=name_of_the_sprint_1,startDate=2022-02-21T13:07:00.000Z,endDate=2022-03-11T13:07:00.000Z,completeDate=2022-03-14T17:19:29.271Z,activatedDate=2022-02-21T20:57:03.111Z,sequence=8837,goal=,autoStartStop=false]', 'com.server.application.service.sprint.Sprint@5fcc83c9[id=8919,rapidViewId=7061,state=CLOSED,name=name_of_the_sprint_2,startDate=2022-03-14T14:52:00.000Z,endDate=2022-04-01T14:52:00.000Z,completeDate=2022-04-04T18:25:08.141Z,activatedDate=2022-03-14T20:52:24.680Z,sequence=8919,goal=,autoStartStop=false]']

This list has two items and what I'm trying to do is to get the name of the sprint name_of_the_sprint_1 and name_of_the_sprint_2 that are after the name=.

What I did until now (I do not know if this is the best and only way to do it) is the following:

df['sprints'].iloc[idx][0].split(',') so it creates a list where I can get the information I want. But I'll need to split it again (I'm gonna find 'name=name_of_the_sprint_1' in this sublist) in order to get only the name I want and need.

Is there a better way extract this information from my dataframe? I'll need to iterate over a dataframe with 3500 rows and do it for each item.

Thanks, folks for the help.

CodePudding user response：

First thing that comes to mind would be to slice the string, starting after the = and ending at the ,. If the list of lists was named data, it might look like this:

data = ["whatever items, not important, name=your_thing_name, some more random stuff,", "even more random stuff, name=a_different_name, some more random things"]

for d in data:
  sub = d.index("name") 5
  val = d[sub:sub d[sub:].index(",")]

As far as performance goes, I ran this and the total time measured about 0.2 seconds

from time import perf_counter as pc

start = pc()

data = []
for i in range(3500):
  data.append(f"this, things, name={i}_loop, very cool, ik")

for d in data:
  sub = d.index("name") 5
  val = d[sub:sub d[sub:].index(",")]
  print(val)

print(pc() - start)

CodePudding user response：

A nested for loop will be useful if you arrange your code neatly, I have tried this with 7000 rows of your data:

def function(df):
    result = []
    for i in df['sprints']:
        split_string = i.split(',')
        for row in split_string:
            if 'name=' in row: 
                aa = row[5:]
                result.append(aa)
    return result

%timeit function()
14.4 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Extra

I've just realized that since you have known the keyword you wish to seek, you can just use re.search to get your output:

def function(df):
    return [re.search('name_of_the_sprint_' r"(\d )",row).group() for row in df['sprints']]

%timeit function(df)
10.9 ms ± 328 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Or if there's different names after the name=, you can try this:

result = [re.search('name=' '\w ',row).group()[5:] for row in df[0]]