Extracting string after particular pattern-CodePudding

"C:\Users\Adam\Desktop\Stock Trackers\Stock Tracker WK39 NYC Beauty.xlsx"

I would like to extract everything AFTER "Stock Tracker WK39" as this represents company name, however, the numbers after "WK" can change depending on the file so I can't just use e.g.:

str.extract('Stock Tracker WK39 (.*)')

How can I rewrite the above so that the "39" is an identifier that can represent any number (including single digits for weeks 1-9)? So that the script always ignores everything up to and including "Stock Tracker WKXX " and only grabs what comes after the white space?

Bear in mind that "NYC Beauty" has a space between it but there will be other companies which won't e.g. "ProformaUnlimited" is another company.

CodePudding user response：

I would instead use the os module to get your filename. Then from there we can extract the information you want. Since you said the string should always start the same with "Stock Tracker WK" I am going to use that assumption.

import os


directory = "C:\\Users\\Adam\\Desktop\\Stock Trackers"
files = os.listdir(directory)
companyNameLength = len("Stock Tracker WK")

weekNumbers = []
otherCompanyNames = []
for file in files:
     weekNumbers.append(file[companyNameLength:].split(" ")[0])
     otherCompanyNames.append(" ".join(file[companyNameLength:].split(" ")[1:]))

print(weekNumbers)
print(otherCompanyNames)

NOTE: if you have files with different naming conventions in this folder you may see empty list elements

CodePudding user response：

You can use a regex to extract exactly those values. An example for your case can be seen here https://regex101.com/r/6u2xBz/1

The code sample:

import re

regex = r"^.*Stock Tracker WK(?P<week_no>\d ) (?P<name>.*)$"

test_str = "C:\\Users\\Adam\\Desktop\\Stock Trackers\\Stock Tracker WK39 NYC Beauty.xlsx"

matches = re.search(regex, test_str)

if matches:
    print ("Match was found at {start}-{end}: {match}".format(start = matches.start(), end = matches.end(), match = matches.group()))
    
    for groupNum in range(0, len(matches.groups())):
        groupNum = groupNum   1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = matches.start(groupNum), end = matches.end(groupNum), group = matches.group(groupNum)))

CodePudding user response：

text = 'C:\Users\Adam\Desktop\Stock Trackers\Stock Tracker WK39 NYC Beauty.xlsx'
text = text.split('WK')
text = text[-1]
text = text[2:]

compact:

text = 'C:\Users\Adam\Desktop\Stock Trackers\Stock Tracker WK39 NYC Beauty.xlsx'
text = text.split('WK')[-1][2:]

CodePudding user response：

I'm sure that a regex solution will be proposed by others, but if you don't want to use regex and (an assumption) that the part before WK39 won't change to contain another WK, you could achieve this just with a couple of splits, e.g.:

x = r"C:\Users\Adam\Desktop\Stock Trackers\Stock Tracker WK39 NYC Beauty.xlsx"
x.split('WK', 1)[-1].split(' ', 1)[1]

The first split on WK leaves you 39 NYC Beauty.xlsx and the second just leaves you with NYC Beauty.xlsx

This utilizes the maxsplit parameter of split, so that, e.g. .split(' ', 1) only splits on the first space that it encounters so it leaves NYC Beauty.xlsx as a single string.