How do I find the line and column offsets for imported Python modules?-CodePudding

I have a class (based on this answer) that uses ast.NodeVisitor to get a list of modules imported by a Python file. However, I also want to return the line and column offsets for where the module names are located in the file.

Code:

import ast

class ImportFinder(ast.NodeVisitor):
    def __init__(self):
        self.imports = []

    def visit_Import(self, node):
        for i in node.names:
            self.imports.append({'import_type': "import", 'module': i.name,})

    def visit_ImportFrom(self, node):
        self.imports.append({'import_type': "from", 'module': node.module})

def parse_imports(source):
    tree = ast.parse(source)
    finder = ImportFinder()
    finder.visit(tree)
    return finder.imports


# Example usage
sample_file = '''
from foo import bar, baz, frob
import bar.baz
import   bar.foo as baf
'''
parsed_imports = parse_imports(sample_file)
for i in parsed_imports:
    print(i)

Current output:

{'import_type': 'from', 'module': 'foo'}
{'import_type': 'import', 'module': 'bar.baz'}
{'import_type': 'import', 'module': 'bar.foo'}

Desired output:

{'import_type': 'from', 'module': 'foo', 'line': 2, 'column_offset': 5}
{'import_type': 'import', 'module': 'bar.baz', 'line': 3, 'column_offset': 7}
{'import_type': 'import', 'module': 'bar.foo', 'line': 4, 'column_offset': 9}

How do I get the line and column offsets for imported Python module names?

CodePudding user response：

You might consider this as a starting point. It doesn't handle continuation lines, but it would be a Machiavellian coder who wrote:

import \
    os

You could handle that by using a filter function to combine the continuations and yield the longer lines.

import re

def parse_imports(source):
    hits = []
    source = re.sub(r"'''[\']'''","",source)
    source = re.sub(r'"""[\"]"""',"",source)
    for no,line in enumerate(source.splitlines()):
        ls = line.lstrip()
        if ls.startswith( "from " ):
            p1 = ls.split()
            mod = p1[1].rstrip()
            i1 = line.find(mod)
            hits.append({
                "import_type": p1[0],
                "module": mod,
                "line": no 1,
                "column_offset": i1
            })
        elif ls.startswith( "import" ):
            cl = ls.split(',')
            p1 = cl[0].split()
            for mod in  [p1[1]]   [c.strip().split()[0] for c in cl[1:]]:
                i1 = line.find(mod)
                hits.append({
                    "import_type": p1[0],
                    "module": mod,
                    "line": no 1,
                    "column_offset": i1
                })
    return hits

# Example usage
sample_file = '''
from foo import bar, baz, frob
import bar.baz
import   bar.foo as baf
import  os,re,  sys
'''
parsed_imports = parse_imports(sample_file)
for i in parsed_imports:
    print(i)

Output:

{'import_type': 'from', 'module': 'foo', 'line': 2, 'column_offset': 5}
{'import_type': 'import', 'module': 'bar.baz', 'line': 3, 'column_offset': 7}
{'import_type': 'import', 'module': 'bar.foo', 'line': 4, 'column_offset': 9}
{'import_type': 'import', 'module': 'os', 'line': 5, 'column_offset': 8}
{'import_type': 'import', 'module': 're', 'line': 5, 'column_offset': 11}
{'import_type': 'import', 'module': 'sys', 'line': 5, 'column_offset': 16}

Note -- I've just noticed a bug here. I strip out all triple-quoted strings, but I don't compensate for those missing lines in the line count. That'll be tricky.

CodePudding user response：

As of Python 3.10, AST.alias objects have line and column attributes. That solves your problem for import statements, because the list of imported names in an import statement are represented as AST.alias objects.

Unfortunately, that doesn't help with from... import; in an ImportFrom object, the module is an identifier, which is a simple string without attributes. (The names imported from the module are AST.alias objects, so each of those does have location information. But you want the location of the module name.)

Still, the statement itself has line and column attributes, even earlier than v3.10, and those tell you where the statement starts and ends. So you could use that information to extract a slice consisting only of the from... import statement, and then use the tokenizer module to get the second token in the from... import statement. (The first token is the from keyword.) That's a bit clunky but it's got to be easier and more reliable than trying to attack Python source with regular expressions.