Extracting specific words from middle of string without using startwith function?-CodePudding

I am trying to extract the names of variables from the list. I have declared a list as list_end, and in the for loop, I want for loop to read each string in the list, and if "int" is found in the string, then it start from the int and end with the list_end, and add to the empty list_variable

def custom_endswith(d, list_end):
    for end_pattern in list_end:
        if d.endswith(end_pattern):
            return True
    return False

list_end = ["=",";",")","("]
list_variable = []

data = ["int zt;",
  "public int w = 3;",
  "public final int nu;public a(d dVar, int i, int i2);",
  "public int getScoreOrder() {int getInteger;",
  "for (int i = 0; i < this.nu; i  )",
  "{public a(d dVar, int t) {super(dVar, i);",
  "private int z = true ;if (getType() != 1) {z = false;",
  "protected int g = true;",
  "unprotected int z = true;if (getType() != 1) {z = false;",
  "public int getType() {return getInteger();",
  "int y;",
  "print (int i) {int k = b.k(parcel);"]

for d in data:
    if d.startswith("int") and custom_endswith(d, list_end):
        list_variable.append(d[4:-1])

But my script only extracts the those variable names, where string with "int". And when "int" comes in the middle of the string, it does not extract words. Is this because I used the.startwith function in script? If not, what other function can I use to extract variable names from the middle of a string?

Output of my script:
print(list_variable)
Output: ['zt', 'y']

In fact, it should give output like:

print(list_variable)
Output: ['zt','w','nu','i','i2','getScoreOrder','getInteger','i','t','z','g','z','getType','y','i','k']

CodePudding user response：

You can do that with regex

import re

regex = re.compile('int\s ([^=;:\(\)] )') # capture characters after int and space (must occur at least once) which is not a character in line_ends


result = []
for line in data:
    result.extend(regex.findall(line))

CodePudding user response：

Yes, it is because you used the .startswith() and .endswith() methods, which only check if the entire string starts with the input parameter string. You can check if a string contains another string like this:

for d in data:
    if "int" in d and ... :
        ...

If you are doing custom string extractions anyways, you could use RegEx, which automatically parses the string for you:

int\s(\w )\s*[=;\)\(]

Match int
Match whitespace between int and the next part (like a space)
Match the variable/function name
Match any one of these characters: =;)(

Then you can use it in the code like this:

import re

re.findall(r"int\s(\w )\s*[=;\)\(]", "public int getScoreOrder() {int getInteger;")

One final note though- it appears you're trying to parse Java code. Maybe try using a proper Java parser library instead, which can do the heavy lifting for you.

CodePudding user response：

Using re is probably the best way to do this - with one caveat. Your list_end characters will be embedded in the re pattern. If you ever needed to change the "list end" characters then you risk breaking the pattern (unless you're very familiar with regular expressions).

It might be useful to show a step-wise approach without re or, indeed, any other imports. (Note - to get the required output you need a comma in the list_end set)

list_end = {'=', ';', ')', '(', ','}

data = ["int zt;",
  "public int w = 3;",
  "public final int nu;public a(d dVar, int i, int i2);",
  "public int getScoreOrder() {int getInteger;",
  "for (int i = 0; i < this.nu; i  )",
  "{public a(d dVar, int t) {super(dVar, i);",
  "private int z = true ;if (getType() != 1) {z = false;",
  "protected int g = true;",
  "unprotected int z = true;if (getType() != 1) {z = false;",
  "public int getType() {return getInteger();",
  "int y;",
  "print (int i) {int k = b.k(parcel);"]

INT = 'int'

vnames = []

for line in data:
    offset = 0
    while (o := line[offset:].find(INT)) >= 0:
        vname = ''
        offset  = o   len(INT)
        for i in line[offset:]:
            if not i.isspace():
                if i in list_end:
                    break
                vname  = i
        if vname:
            vnames.append(vname)

print(*vnames)

Output:

zt w nu i i2 getScoreOrder getInteger i t z g z getType y i k

Note:

The isspace() test is there to allow for multiple whitespace between 'int' and the variable name. The code can be simplified if it's reasonable to assume that there's exactly one space between those two tokens