Note This document is still a “work in progress”!

Parser provides following functionality:

Tokenize

Tokeniser takes a string as input and then brakes it up into tokens: identifiers, operators etc.

TODO: more details required.

Parse

Parser is used to group given tokens into statements and their parts in hierarchical order. All tokens matching reserved words are marked as “Keywords”.

Parser is used for a number of purposes:

  1. SQL Worksheet uses parser to identify different SQL statements (used when executing current, next or all statements).
  2. Code indentation (code formatter).
  3. PLSQLEditor uses parser result to create a code tree (for example showing functions/procedures in a package, their parameters etc.)

Note: As purpose of parser is limited in TOra (it is not required to check if given statement is correct) it is only doing approximate parsing!

There are a number of type of statements identified by the parser (as described in tosqlparse.h):

  • Block – starts after following:
    • begin token,
    • declare token,
    • as/is,
    • then token if it goes in goes in statement starting with if or case,
    • do or loop token
  • Statements
  • List – starts after opening parenthesis and ends with closing parenthesis.
  • Keyword – all tokens listed in registered words array.
  • Token – all tokens not in reserved words list.
  • Raw – unparsed data

Tip: if you want to check what is a structure of parsed text you can use printstatement function (in tosqlparse.cpp). Results of that function are used in example below.

Each statement will also have it's class identified. Class information is used later then passing information to Oracle. For example trailing semicolon must be removed for ddl/dml statements while plsql blocks must have this trailing semicolon. Therefore following classes are identified in TOra:

  • ddldml – all ddl/dml statements (select, insert, create table etc.)
  • plsqlblock – all plsql blocks (named as well as anonymous)

Simplest example

The simplest statement example would be:

select sysdate from dual;

This one would be parsed as one statement having five sub-tokens.

Statement:
    Keyword: select
    Keyword: sysdate
    Keyword: from
    Token: dual
    Token: ;

As you can see parser has not only identified all of this as one statement, it has also marked keywords and left all other tokens as simple “tokens” (this information would later be used in code formatting).

Multiple statement example

If there are two simplest statements:

select sysdate from dual;
select sysdate from dual;

Parser would identify two statements consisting of similar sub-tokens:

Statement:
    Keyword: select
    Keyword: sysdate
    Keyword: from
    Token: dual
    Token: ;
Statement:
    Keyword: select
    Keyword: sysdate
    Keyword: from
    Token: dual
    Token: ;

Statement with a list

Let's parse a simple DDL statement containing some lists:

create table test(col varchar(12));

Is parsed like this:

Block:
  Statement:
      Keyword:create
      Keyword:table
      Token:test
      Token:(
  List:
      Token:col
      Keyword:varchar
      Token:(
      List:
          Token:12
      Token:)
  Statement:
      Token:)
      Token:;

Note! lists could contain other inner lists!

Case statement

Statement with PL/SQL statement:

CREATE OR REPLACE PROCEDURE A AS
BEGIN
  CASE a
    WHEN 1 THEN NULL;
    WHEN 2 THEN NULL;
    ELSE NULL;
    END CASE;
END;

Is parsed:

Block:
  Statement:
      Keyword:CREATE
      Keyword:OR
      Keyword:REPLACE
      Keyword:PROCEDURE
      Token:A
      Keyword:AS
  Statement:
      Keyword:BEGIN
  Block:
      Statement:
          Keyword:CASE
          Token:a
          Keyword:WHEN
          Token:1
          Keyword:THEN
      Statement:
          Token:NULL
          Token:;
      Statement:
          Keyword:WHEN
          Token:2
          Keyword:THEN
      Statement:
          Token:NULL
          Token:;
      Statement:
          Keyword:ELSE
      Statement:
          Token:NULL
          Token:;
      Statement:
          Token:END
          Keyword:CASE
          Token:;
  Statement:
      Token:END
      Token:;
Indent

Indentation uses parsers result. It is important for indentation that parser correctly recognises statements, blocks, keywords etc.

Block

Comment is added first (if it was attached to the statement). Loop through sub-statements. Same indent functionality is called recursively for each statement.

List & Statement

List and statement types are processed similarly.

Unit tests

Parser functionality is used in many parts of TOra. Small changes can fix some problems and at the same time brake a lot of other stuff. In order to increase reliability and decreasing regression testing time unit tests are used.

Currently unit test code can be found in tosqlparse.cpp function main. Therefore if you want to compile TOra to run unit tests (instead launching application itself) you have to take out main.cpp from cmake. Easiest (but not optimal) way to do that is to comment out main.cpp from src/CMakeLists.txt and uncomment (tosqlparsertest.cpp).

Unit test will perform following tests:

  1. Parses a big test text containing a lot of different statements. If this fails message box'es are displayed with error descriptions.
  2. Big test text is indented (formatted) and result is formatted again (parsing and formatting result of first indentation). If results do not match (which means parsing and/or indentation have bugs) corresponding message is printed on stdout.
 
knowhow/parser.txt · Last modified: 2010/11/13 15:12 by Tomas Straupis
Recent changes RSS feed Privacy Policy Support This Project