(python - cpp) - How to split the c codes while writing a lexical analyzer in python?-CodePudding

I wrote a lexical analyzer for cpp codes in python, but the problem is when I use input.split(" ") it won't recognize codes like x=2 or function() as three different tokens unless I add an space between them manually, like: x = 2 . also it fails to recognize the tokens at the beginning of each line. (if i add spaces between each two tokens and also at the beginning of each line, my code works correctly)

I tried splitting the code first by lines then by space but it got complicated and still I wasn't able to solve the first problem. Also I thought about splitting it by operators, yet I couldn't actually implement it. plus I need the operators to be recognized as tokens as well, so this might not be a good idea. I would appreciate it if anyone could give any solution or suggestion, Thank You.

f=open("code.txt")
input=f.read()
input=input.split(" ")

f=open("code.txt")
input=f.read()
input1=input.split("\n")
for var in input1:
 var=var.split(" ")

CodePudding user response：

Obviously, if you try to have success splitting such an expression like x=2 and also x = 2... it seems pretty obvious that isn't going to work.

What you are looking is for a solution that works with both right?

Basic solution is to use an and operator, and use the conditions that you need to parse. Note that this solution isn't scalable, neither fits into the category of good practices, but it can help you to figure out better but harder solutions.

if input.split(' ') and input.split('='):

An intermediate solution would be to use regex. Regex isn't an easy topic, but you can checkout online documentation, and then you have wonderful online tools to check your regex codes. Regex 101

The last one, would be to convert your input data into an AST, which stands for abstract syntax tree. This is the technique employed by C compilers like, for example, Clang. This last one is a real hard topic, so for figure out a basic lexer, probably will be really time consuming, but maybe it could fit your needs.

CodePudding user response：

The usual approach is to scan the incoming text from left to right. At each character position, the lexical analyser selects the longest string which fits some pattern for a "lexeme", which is either a token or ignored input (whitespace and comments, for example). Then the scan continues at the next character.

Patterns are often described using regular expressions, but the standard regular expression library is not as much help as it could be for this procedure, because it does not have the facility of checking multiple regular expressions in parallel. Or, more precisely, it can check multiple expressions in parallel (using alternation syntax, (...|...|...)), but it lacks an interface which can report which of the alternatives was matched. [Note 1]. So it would be necessary to try every possible pattern one at a time and select whichever one turns out to have the longest match.

Note that the matches are always anchored at the current input point; the lexical analyser does not search for a matching pattern. Every input character becomes part of some lexeme, even if that lexeme is ignored, and lexemes do not overlap.

You can write such an analyser by hand for a simple language, but it's usually easier to build one automatically using software designed for that purpose. These have been around for a long time -- Lex was written almost 50 years ago, for example -- and if you are planning on writing more than one lexical analyser, you would be well advised to investigate some of the available tools.

Notes

The PCRE2 and Oniguruma regex libraries provide a "callout" feature which I believe could be used for this purpose. I haven't actually seen it used in lexical analysis, but it's a fairly recent addition, particularly for Oniguruma.