The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Marpa::R2::Semantics::Rank - How ranks are computed

Synopsis

    my $source = <<'END_OF_SOURCE';
      :discard ~ ws; ws ~ [\s]+
      :default ::= action => ::array

      Top ::= List action => main::group
      List ::= Item3 rank => 3
      List ::= Item2 rank => 2
      List ::= Item1 rank => 1
      List ::= List Item3 rank => 3
      List ::= List Item2 rank => 2
      List ::= List Item1 rank => 1
      Item3 ::= VAR '=' VAR action => main::concat
      Item2 ::= VAR '='     action => main::concat
      Item1 ::= VAR         action => main::concat
      VAR ~ [\w]+

    END_OF_SOURCE

    my @tests = (
        [ 'a',                 '(a)', ],
        [ 'a = b',             '(a=b)', ],
        [ 'a = b = c',         '(a=)(b=c)', ],
        [ 'a = b = c = d',     '(a=)(b=)(c=d)', ],
        [ 'a = b c = d',       '(a=b)(c=d)' ],
        [ 'a = b c = d e =',   '(a=b)(c=d)(e=)' ],
        [ 'a = b c = d e',     '(a=b)(c=d)(e)' ],
        [ 'a = b c = d e = f', '(a=b)(c=d)(e=f)' ],
    );

    my $grammar = Marpa::R2::Scanless::G->new( { source => \$source } );
    for my $test (@tests) {
        my ( $input, $output ) = @{$test};
        my $recce = Marpa::R2::Scanless::R->new(
            {
                grammar        => $grammar,
                ranking_method => 'high_rule_only'
            }
        );
        $recce->read( \$input );
        my $value_ref = $recce->value();
        if ( not defined $value_ref ) {
            die 'No parse';
        }
        push @results, ${$value_ref};
    }
    sub flatten {
        my ($array) = @_;

        # say STDERR 'flatten arg: ', Data::Dumper::Dumper($array);
        my $ref = ref $array;
        return [$array] if $ref ne 'ARRAY';
        my @flat = ();
      ELEMENT: for my $element ( @{$array} ) {
            my $ref = ref $element;
            if ( $ref ne 'ARRAY' ) {
                push @flat, $element;
                next ELEMENT;
            }
            my $flat_piece = flatten($element);
            push @flat, @{$flat_piece};
        }
        return \@flat;
    }

    sub concat {
        my ( $pp, @args ) = @_;

        # say STDERR 'concat: ', Data::Dumper::Dumper(\@args);
        my $flat = flatten( \@args );
        return join '', @{$flat};
    }

    sub group {
        my ( $pp, @args ) = @_;

        # say STDERR 'comma_sep args: ', Data::Dumper::Dumper(\@args);
        my $flat = flatten( \@args );
        return join '', map { +'(' . $_ . ')'; } @{$flat};
    }

Description

This document describes rule ranking. Rule ranking plays a role in parse ordering, which is described in a separate document.

Overview

Rule ranking takes place at the nodes of the parse forest. In the context of rule ranking, the parse forest nodes are called choicepoints. Every choicepoint has one or more choices. Choices that are from the same choicepoint are called siblings.

Choices are ordered by their choice key. The choice key consists of two subkeys: the symbolic subkey and the null-variant subkey. The symbolic subkey is the major subkey, and the null-variant subkey is the minor subkey, so that the order is by null-variant within symbolic rank. The symbolic subkey is so called because it is based on the symbols of the choice. The null-variant subkey is based on the location of the choice's nulled symbol instances. Choicepoints, choices and subkeys are described in much more detail below.

Ranking methods

SLIF recognizer objects have a ranking_method named argument, whose value can be the name of a ranking method, or "none", indicating that the default ranking method is to be used.

The rule ranking method

The rule ranking method ranks choices according to their choice keys. All choices are included in the ordering.

The high_rule_only ranking method

The high_rule_only ranking method is similar to the rule ranking method, except that, at every choicepoint, it discards all of the choices which have a rank lower than that of the highest ranked choice. The high_rule_only ranking method can reduce the ambiguity of a parse, but it does not necessarily do so. This is because a choice with the highest choice key can have one or more siblings with the same choice key.

The symbolic subkey

Every rule alternative has a numeric symbolic rank. A rule's rank can be specified using the the rank adverb argument for that RHS alternative. Rule ranks must be integers. They may be negative. If no numeric rank is specified, the numeric rank is 0.

Rule alternatives may be part of a single rule in the DSL -- for example, a prioritized rule. Lexical order within a DSL rule makes no difference when ranking rule alternatives. For example, it makes no difference if two rule alternatives come from the same prioritized rule; or from two different prioritized rules.

The null-variant subkey

Some rules have a RHS which contains proper nullables: symbols which may be nulled, but which are not nulling symbols. (Nulling symbols are symbols which are always nulled.)

When a rule alternative contains proper nullables, each instance of that rule creates a nulling variant. A nulling variant is a specific pattern of null and non-null symbols in a rule instance's RHS. In many cases, this creates an ambiguity -- different nulling variants can match the same substring in the input. In ambiguous parsings of this kind, some applications may want to rank nulling variants that start with non-null symbols higher. Other applications may want to do the opposite -- to rank nulling variants that start with null symbols higher.

The null-ranking adverb for RHS alternatives specifies which nulling variants are ranked high or low. If the null-ranking is "low", then the closer a nulling variant places its visible (non-null) symbols to the start of the rule instance, the higher it ranks. A null ranking of low is the default. If the null-ranking is "high", then the closer a nulling variant places its null symbols to the start of the rule instance, the higher it ranks. In ranking nulling variants with more than one proper nullable, major-to-minor is left-to-right.

Choicepoints

Ranking is done at the or-nodes of our parse forest, We also call our parse forest a bocage. In the context of ranking, we usually refer to the or-nodes as choicepoints.

An or-node is similar to an Earley item: it has a dotted rule, an origin, a current location, and a set of confluences. The confluences, in turn, each can have a mainstem and a tributary. To refresh our knowledge about Earley items and confluences, we can refer to our document on the Marpa algorithm.

A confluence is a reason for the choicepoint to be in the parse forest. Every choicepoint has at least one reason to be in the parse forest, or it would not be there. Therefore every choicepoint has at least one confluence. In an ambiguous parse, one or more choicepoints will have more than one confluence. A parse tree is created by selecting a single confluence (or reason to exist) at each choicepoint.

In the context of ranking and choicepoints, confluences are more often called choices. If there is only one choice at a choicepoint, we say that the choicepoint is a singleton choicepoint. If a choicepoint is not a singleton, we say that it is an ambiguous choicepoint.

Choices with the same choice key are ranked arbitrarily. If all the choices of a choicepoint have the same choice key, we say that that the choicepoint is trivial. A choicepoint is non-trivial if and only if it is not trivial.

Every singleton choicepoint is trivial, but some trivial choicepoints are not singletons. This is because sibling choices can share the same choice key.

Within a non-trivial choicepoint, the mainstems and the LHS of the rules in tributaries must always be the same. (For the explanation of why this is the case, see "Explanations".) Therefore, when the symbolic subkeys of two choices within the same non-trivial choicepoint differ, they will differ only in the RHS of the rules of their tributaries.

The predot symbol of a choicepoint is the predot symbol of its dotted rule. The LHS of the tributary rules of a choicepoint's choices is always the same as the predot symbol of that choicepoint.

Iteration of the parse forest is depth-first, left-to-right. An detailed explanation of the parse forest interation is in "Explanations".

Examples

Our examples in this document will look at the ranked grammar in the synopsis, and at variations of it.

Longest highest, version 1

The DSL in the synopsis ranks its items "longest highest". Here "items" are represented by the symbols, <Item3>, <Item2> and <Item1>. The "longest" choice is considered to be the one with the most lexemes. Working this idea out for this grammar, we see that the items should rank, from highest to lowest: <Item3>, <Item2> and <Item1>.

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array

  Top ::= List action => main::group
  List ::= Item3 rank => 3
  List ::= Item2 rank => 2
  List ::= Item1 rank => 1
  List ::= List Item3 rank => 3
  List ::= List Item2 rank => 2
  List ::= List Item1 rank => 1
  Item3 ::= VAR '=' VAR action => main::concat
  Item2 ::= VAR '='     action => main::concat
  Item1 ::= VAR         action => main::concat
  VAR ~ [\w]+

To see how ranking works in this grammar, we note that, for a choicepoint to be non-trivial, it must have a predot symbol that is the LHS of more than one rule. Therefore, in the above grammar, a choicepoint must have one of these dotted rules:

     Top ::= List .
     List ::= List . Item3
     List ::= List . Item2
     List ::= List . Item1

The tributary of these choicepoints may be any one of

     List ::= Item3 .
     List ::= Item2 .
     List ::= Item1 .
     List ::= List Item3 .
     List ::= List Item2 .
     List ::= List Item1 .

A little effort shows that <Item2> never participates in an ambiguity, so that these are the possible tributaries in non-trivial choicepoints:

     List ::= Item1 .
     List ::= List Item1 .
     List ::= Item3 .
     List ::= List Item3 .

Shortest highest, version 1

Here we see the grammar of the synopsis, reworked for a "shortest highest" ranking. "Shortest highest" is the reverse of "longest highest".

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array

  Top ::= List action => main::group
  List ::= Item3 rank => 1
  List ::= Item2 rank => 2
  List ::= Item1 rank => 3
  List ::= List Item3 rank => 1
  List ::= List Item2 rank => 2
  List ::= List Item1 rank => 3
  Item3 ::= VAR '=' VAR action => main::concat
  Item2 ::= VAR '='     action => main::concat
  Item1 ::= VAR         action => main::concat
  VAR ~ [\w]+

Here are what the results will look like for "shortest highest".

    my @tests = (
        [ 'a',                 '(a)', ],
        [ 'a = b',             '(a=)(b)', ],
        [ 'a = b = c',         '(a=)(b=)(c)', ],
        [ 'a = b = c = d',     '(a=)(b=)(c=)(d)', ],
        [ 'a = b c = d',       '(a=)(b)(c=)(d)' ],
        [ 'a = b c = d e =',   '(a=)(b)(c=)(d)(e=)' ],
        [ 'a = b c = d e',     '(a=)(b)(c=)(d)(e)' ],
        [ 'a = b c = d e = f', '(a=)(b)(c=)(d)(e=)(f)' ],
    );

Longest highest, version 2

The previous examples have shown the rule involved in parse ranking in "spelled out" form. In fact, a more compact form of the grammar can be used, as shown below for "longest highest" ranking.

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array

  Top ::= List action => main::group
  List ::= Item rank => 1
  List ::= List Item rank => 0
  Item ::= VAR '=' VAR rank => 3 action => main::concat
  Item ::= VAR '='     rank => 2 action => main::concat
  Item ::= VAR         rank => 1 action => main::concat
  VAR ~ [\w]+

Shortest highest, version 2

This is the grammar for "shortest highest", in compact form:

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array

  Top ::= List action => main::group
  List ::= Item rank => 0
  List ::= List Item rank => 1
  Item ::= VAR '=' VAR rank => 1 action => main::concat
  Item ::= VAR '='     rank => 2 action => main::concat
  Item ::= VAR         rank => 3 action => main::concat
  VAR ~ [\w]+

Reimplementing ranking as pure BNF

It is generally better to write a grammar as "pure BNF", instead of using ranking. The advantage of using pure BNF is that you can more readily determine exactly what language it is that you are parsing: Ranked grammars make look easier to analyze at first glance, but the more you look at them the more tricky you realize they are.

The pure BNF reimplementations below rely on an observation: The parse string becomes easier to analyze when we think in terms of fenceposts, rather than in term of the location of the lexemes. Fenceposts are either initial, final or medial. The initial fencepost is the position before the first lexeme. The final fencepost is the position after the last lexeme. A medial fencepost is the position between two lexemes. We can call a fencepost a VAR-bound if it is a either an initial fencepost, a final fencepost, or a medial fencepost that occurs between two <VAR> lexemes. We can then visualize the input string as a sequence of "VAR-bounded" substrings.

Longest highest as pure BNF

Here is the "longest highest" example, reimplemented as BNF:

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array

  Top            ::= Max_Boundeds action => main::group
  Top            ::= Max_Boundeds Unbounded action => main::group
  Top            ::= Unbounded action => main::group
  Max_Boundeds   ::= Max_Bounded+
  Max_Bounded    ::= Eq_Finals Var_Final3
  Max_Bounded    ::= Var_Final
  Unbounded      ::= Eq_Finals
  Eq_Finals      ::= Eq_Final+
  Var_Final      ::= Var_Final3 | Var_Final1
  Var_Final3     ::= VAR '=' VAR action => main::concat
  Eq_Final       ::= VAR '='     action => main::concat
  Var_Final1     ::= VAR         action => main::concat
  VAR ~ [\w]+

Shortest highest as pure BNF

We can also reimplement the "shortest highest" example as BNF. One of the advantages of a BNF (re)implementation, is that it often clarifies the grammar. For example in this case, we note that the DSL rule

  Var_Final3     ::= VAR '=' VAR action => main::concat

is, in fact, never used. We therefore omit it:

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array

  Top            ::= Max_Boundeds action => main::group
  Top            ::= Max_Boundeds Unbounded action => main::group
  Top            ::= Unbounded action => main::group
  Max_Boundeds   ::= Max_Bounded+
  Max_Bounded    ::= Eq_Finals Var_Final
  Max_Bounded    ::= Var_Final
  Unbounded      ::= Eq_Finals
  Eq_Finals      ::= Eq_Final+
  Eq_Final       ::= VAR '='     action => main::concat
  Var_Final      ::= VAR         action => main::concat
  VAR ~ [\w]+

Explanations

Bocages

When ranking, the logic traverses each node of a parse forest, which we call a "bocage". Our parse forests closely resemble Elizabeth Scott's SPPF's. See "Scott 2008" in Marpa::R2::Advanced::Bibliography.

Only confirmations are choicepoints

An Earley item (and a choicepoint) is a confirmation if and only if its dot position is not zero. Confirmations are either scanned Earley items or reductions. Non-confirmations (predictions and the start Earley item) are redundant information from the point of of the bocage. No or-nodes are created for non-confirmations and they can never be choicepoints.

Since choicepoints are always either scanned Earley items or reductions, all the choices of every choicepoint have a well-defined mainstem and a well-defined tributary.

Choicepoint mainstem is unique

Since every choicepoint has exactly one dotted rule, exactly one current location, and exactly one origin, every choicepoint has exactly one mainstem. All the choices for that choicepoint share that mainstem. Because every choicepoint has exactly one mainstem, every choicepoint has exactly one predot symbol.

Scanned choicepoints are always trivial

The tributary of a scanned choicepoint is a token, which has a value and a token symbol. The token value plays no role in ranking. The token symbol is always the same as the predot symbol.

The token symbol can be assigned a rank, but this is pointless: Since there is only one predot symbol in a choicepoint, all token symbols will be equal to the predot symbol, and therefore all token symbols of sibling choices will be the same.

Because every symbol in a choice of a scanned choicepoint is the same, the symbolic subkeys of all the sibling choices of a scanned choicepoint must be identical to each other. Since there is no BNF rule for the tributary of a scanned choicepoint, the choices of a scanned choicepoint also have no null-variants. This means that the choice keys of the sibling choices in a scanned choicepoint are always identical to each other. Therefore all choices of scanned choicepoints rank the same, and all scanned choicepoints are trivial.

While a scanned choicepoint is never non-trivial, a scanned choicepoint may be ambiguous if variable length tokens are in use. Usually variable length tokens are not in use, and all scanned choicepoint are both trivial and singletons.

All non-trivial choicepoints are reductions

Since every scanned choicepoint must be trivial, every non-trivial choicepoint will be a reduction. If a choicepoint is a reduction, all of its tributaries will be other choicepoints, and each tributary will have a dotted rule associated with it. The dotted rules of these tributaries will always be completions. The LHS of the rule of these tributaries will always be the predot symbol of the choicepoint. This means that the rules of the tributaries of the sibling choices in a reduction choicepoint must share the same LHS. Rules of the tributaries of choices in a reduction choicepoint can have different RHS's, however, and therefore can differ in symbolic rank.

Ambiguous choicepoints

All non-trival choicepoints are ambiguous, but some trivial choicepoints are also ambiguous. This is because choices at choicepoints may differ not just symbolically, but in the way in which the symbols divide up the input string -- the way in which they "factor" the input. Choices which have the same symbols, but which are factored differently, will have the same rank. For more about symbolic choices (also called symches) and factorings, see "Ambiguity: factoring versus symches" in Marpa::R2::Glade.

Iterating the parses

Every node of a parse tree corresponds to a choice from a choicepoint of the parse forest. An initial parse subtree is the subtree formed from an choicepoint-rooted subforest by taking all the first choices of its choicepoints.

The initial parse tree is the initial parse subtree formed from the subforest whose root in the root of the forest. In other words, the initial parse tree is the tree formed by taking the first choices of the entire forest.

The first parse tree in the iterator of a parse forest is the initial parse tree.

For the second and later parse trees, the next parse tree is found by traversing the current parse tree from the bottom up, left to right. Every node of the parse tree will correspond to a choicepoint of the parse forest, and a choice within that choicepoint. If the current choice of its choicepoint is the last choice of that choicepoint, we say that that node of the parse tree is exhausted. If a parse tree node is not exhausted, we say that is active.

As the traversal of the parse tree encounters exhausted nodes, it prunes them from the tree. The traversal ends when it encounters an active parse tree node. We call that active parse tree the iteration parse tree node.

Once the iteration parse tree node has been found, it is replaced with a new tree node which corresponds to the next choice of the choicepoint of the iteration parse tree node.

The remaining tree will have missing subtrees, due to the replacement of the iteration parse tree node, and due to the pruning of exhausted nodes. These subtrees are replaced with initial parse subtrees.

When the entire parse tree is traversed without finding an active parse tree node, there are no more parse trees. In that case, the parse forest iterator is said to be exhausted.

Motivation

We note that ranking is only by direct tributaries. It might reasonably be asked, why not, at least in the case of a tie, look at tributaries of tributaries? Or why not resolve ties by looking at tributaries of mainstems?

Marpa's built-in rule ranking was chosen as the most powerful system that could be implemented with effectively zero cost. Ranking by direct tributaries uses only information that is quickly and directly available, so that its runtime cost is probably not measurable.

The complexity of the specification was also an issue. If indirect tributaries are taken into account, we would need to specify which tributaries, and under what circumstances they are used. Only tributaries of tributaries? Or tributaries of mainstems? Depth-first, or breadth-first? Only in case of ties, or using a more complex metric? To arbitrary depth, or using a traversal that is cut off at some point?

If we do ranking by direct tributaries only, that makes the answers to the above questions as simple as it can be. When we try to analyze the behavior of grammars that use rankings, the importance of having a specification that is (relatively) simple becomes clear.

There are apps whose requirements justify extra overhand and extra complexity. For these apps, Marpa's ASF's allow full generality in ranking.

Copyright and License

  Copyright 2022 Jeffrey Kegler
  This file is part of Marpa::R2.  Marpa::R2 is free software: you can
  redistribute it and/or modify it under the terms of the GNU Lesser
  General Public License as published by the Free Software Foundation,
  either version 3 of the License, or (at your option) any later version.

  Marpa::R2 is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  Lesser General Public License for more details.

  You should have received a copy of the GNU Lesser
  General Public License along with Marpa::R2.  If not, see
  http://www.gnu.org/licenses/.