Home > OS >  Get a substring from a string with python
Get a substring from a string with python

Time:04-07

I'm stuck trying to get a substring from a list of strings using pandas. Basically, the application returns data in this way:

['com.server.application.service.sprint.Sprint@20137b52[id=8837,rapidViewId=7061,state=CLOSED,name=name_of_the_sprint_1,startDate=2022-02-21T13:07:00.000Z,endDate=2022-03-11T13:07:00.000Z,completeDate=2022-03-14T17:19:29.271Z,activatedDate=2022-02-21T20:57:03.111Z,sequence=8837,goal=,autoStartStop=false]', 'com.server.application.service.sprint.Sprint@5fcc83c9[id=8919,rapidViewId=7061,state=CLOSED,name=name_of_the_sprint_2,startDate=2022-03-14T14:52:00.000Z,endDate=2022-04-01T14:52:00.000Z,completeDate=2022-04-04T18:25:08.141Z,activatedDate=2022-03-14T20:52:24.680Z,sequence=8919,goal=,autoStartStop=false]']

This list has two items and what I'm trying to do is to get the name of the sprint name_of_the_sprint_1 and name_of_the_sprint_2 that are after the name=.

What I did until now (I do not know if this is the best and only way to do it) is the following:

df['sprints'].iloc[idx][0].split(',') so it creates a list where I can get the information I want. But I'll need to split it again (I'm gonna find 'name=name_of_the_sprint_1' in this sublist) in order to get only the name I want and need.

Is there a better way extract this information from my dataframe? I'll need to iterate over a dataframe with 3500 rows and do it for each item.

Thanks, folks for the help.

CodePudding user response:

First thing that comes to mind would be to slice the string, starting after the = and ending at the ,. If the list of lists was named data, it might look like this:

data = ["whatever items, not important, name=your_thing_name, some more random stuff,", "even more random stuff, name=a_different_name, some more random things"]

for d in data:
  sub = d.index("name") 5
  val = d[sub:sub d[sub:].index(",")]

As far as performance goes, I ran this and the total time measured about 0.2 seconds

from time import perf_counter as pc

start = pc()

data = []
for i in range(3500):
  data.append(f"this, things, name={i}_loop, very cool, ik")

for d in data:
  sub = d.index("name") 5
  val = d[sub:sub d[sub:].index(",")]
  print(val)

print(pc() - start)

CodePudding user response:

A nested for loop will be useful if you arrange your code neatly, I have tried this with 7000 rows of your data:

def function(df):
    result = []
    for i in df['sprints']:
        split_string = i.split(',')
        for row in split_string:
            if 'name=' in row: 
                aa = row[5:]
                result.append(aa)
    return result

%timeit function()
14.4 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Extra

I've just realized that since you have known the keyword you wish to seek, you can just use re.search to get your output:

def function(df):
    return [re.search('name_of_the_sprint_' r"(\d )",row).group() for row in df['sprints']]

%timeit function(df)
10.9 ms ± 328 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Or if there's different names after the name=, you can try this:

result = [re.search('name=' '\w ',row).group()[5:] for row in df[0]]
  • Related