How to parse LuaDoc param comments using Haxe's hxparse

danielo515 · March 6, 2023, 10:24am

I’m trying to parse some specific lines of LuaDoc using hxparse.
Those lines looks like this: ---@param name type description where the type can also be a sum type so in reality it can be string|number in any amount.
I already stripped out the prefix, so what is left to parse is just name type description, and it is very context aware.
I have defined the following Token enums, one for the regular parsing, and another specific for types

The regular one

enum DocToken {
  Identifier(name:String);
  Description(text:String);
  DocType(type:TypeToken);
  ArrayMod;
  OptionalMod;
  Comma;
  CurlyOpen;
  CurlyClose;
  SquareOpen;
  SquareClose;
  Lparen;
  Rparen;
  TypeOpen;
  TypeClose;
  Pipe;
  Spc;
  EOL;
}

The one specific for types

enum TypeToken {
  Function;
  Number;
  String;
  Table;
  Boolean;
  Nil;
}

The problem lies when I try to switch on the different combinations of DocType(t) with or without pipes.
Here are the lexer rules:

  static var ident = "[a-zA-Z_][a-zA-Z0-9_]*";

  public static var desc = @:rule [
    "[^\n]*" => Description(lexer.current.ltrim()),
    "" => EOL
  ];
  public static var paramDoc = @:rule [
    ident => {final name = lexer.current.ltrim().rtrim(); Identifier(name);},
    "" => EOL,
  ];
  public static var typeDoc = @:rule [
    // " " => Spc,
    " " => lexer.token(typeDoc),
    "," => lexer.token(typeDoc),
    "\\[\\]" => ArrayMod,
    "\\?" => OptionalMod,
    "<" => TypeOpen,
    ">" => TypeClose,
    "{" => CurlyOpen,
    "}" => CurlyClose,
    "[" => SquareOpen,
    "]" => SquareClose,
    "\\(" => Lparen,
    "\\)" => Rparen,
    "\\|" => Pipe,
    "number" => DocType(TypeToken.Number),
    "string" => DocType(TypeToken.String),
    "table" => DocType(TypeToken.Table),
    "boolean" => DocType(TypeToken.Boolean),
    "function" => DocType(TypeToken.Function),
    "fun" => DocType(TypeToken.Function),
    "nil" => DocType(Nil),
    "" => EOL,
    // ident => throw 'Unknown type "${lexer.current}"',
  ];

My first problem appears when I try to parse using the 3 main elements. Because I have 3 different rulesets, I can go as blindly as:

      case [Identifier(name), SPC, DocType(t), SPC, Description(d)]:

I have to first match on identifier, then check if the next element is EOL, and in that case return, and then select the next ruleset and continue parsing:

  public function parse() {
    return switch stream {
      case [Identifier(name)]:
        stream.ruleset = LuaDocLexer.typeDoc;
        if (this.peek(1) == EOL)
          return {name: name, type: null, description: null};
        try {
          final t = parseType();
          stream.ruleset = LuaDocLexer.desc;
          final text = parseDesc();
          return {name: name, type: t, description: text};

This is not exactly how I wanted it, but at least works.
The problem gets even worse later in the parseType method, because, as soon as I try to put a Pipe between two types, the compiler complains that I am not using the cases that are blow that one. Here:

  public function parseType() {
    return switch stream {
      case [DocType(Table), TypeOpen, t = parseTypeArgs()]:
        'Table<$t>';
      case [DocType(t)]:
        t + "";
      case [DocType(t), Pipe, t2 = parseEither()]: 'Either<$t, $t2>'; // Here it says pipe is unused, and this never mathces
    }
  }
  public function parseEither() {
    return switch stream {
      case [DocType(t), Pipe, t2 = parseEither()]: 'Either<$t, $t2>';
      case [DocType(t)]: '$t'; // Here also says the case is unused
    };
  }

I am starting to think that I am missing some key concept. I tried including SPC as a token, but I’m not sure how to match with it without having an explosion of cases where I should account for extra spaces.
If I use concrete types, for example like this:

case [DocType(Number), Pipe, t2 = parseEither()]: 'Either<$t, $t2>';

then it is not a problem, but I really want to be able to combine any two values with the Pipe operator

danielo515 · March 6, 2023, 6:02pm

Ok, I think the key part is that it only uses the first element to differentiate between cases, so you need to put something different on those and cascade to another switch of the stream or call another parser or something

The following works, and allows " name" "name" and "name rest"

      case [SPC]: parse();
      case [Identifier(name)]:
        switch stream {
          case [EOL]:
            return {name: name, type: null, description: null};
          case [SPC]:
            stream.ruleset = LuaDocLexer.typeDoc;
            try {

So for my previous example, it seems that this is the proper (or at least working) way of proceeding:

  public function parseType() {
    return switch stream {
      case [DocType(Table), TypeOpen, t = parseTypeArgs()]:
        'Table<$t>';
      case [DocType(t)]:
        switch stream {
          case [Pipe, t2 = parseEither()]: 'Either<$t, $t2>';
          case _: '$t';
        }
    }
  }

danielo515 · March 10, 2023, 3:50pm

In order to work correctly with HxParse there are a couple of key concepts that I was missing. Thankfully it’s author clarified them, so I will share how I am facing this task now.

First, it’s important to know that HxParse is a recursive descent parser, so understanding them, reading about them etc was key to properly use HxParse.
A second key concept, is that not every element in a list of matches is considered for a first match. The first element in each case of a switch stream is the only part that is considered for a match. All the elements after that are considered a requirement and the parser will error if they fail to match. So for example, here:

      case [Identifier(name), SPC, DocType(t), SPC, Description(d)]:

Only the first Identifier(name) is considered for the match, and the rest of the elements are a requirement. It’s worth mentioning that every switch stream is processed by a macro that rewrites it, so the above code will run more or less like this:

peek(0) (this does not consumes the token)
Match it agains the first element on the first case
If it does not match, continue and if nothing matches raise a warning
If it matches, consume the token (junk()) and from here every unmet requirement will be throwing an error

That’s why, such statement will fail if the identifier is the only element, because the next token will be a Eof, not a SPC, it will fail if there is no description, etc.
The way of handling this optional cases is, to first match the first mandatory element, and then match the other cases you are expecting, like this:

    return switch stream {
      case [SPC]: parse(); // discard leading spaces
      case [Identifier(name)]: 
        switch stream { 
// Now the Identifier was matched, continue  the other possible cases
          case [EOL]: //No type or description, finish here
            return {name: name, type: "Any", description: ""};
          case [SPC]: //Space, means there must be a type after this
            stream.ruleset = LuaDocLexer.typeDoc;
            try {
              Log.print("About to parse types");
              final t = parseType();

The reason I was getting compiler warnings about cases where I was trying to match several times against the same item make sense now. Because the code is re-written to only match the first element in the list and then the rest is mandatory.
So this code:

    return switch stream {
      case [DocType(t), Pipe, t2 = parseEither()]: 'Either<$t, $t2>';
      case [DocType(t)]: '$t'; // Here also says the case is unused
    };

Will be rewritten (conceptually) like this

    return switch stream {
      case DocType(t): 
          // forcefully look for a Pipe and fail if not
          // t2 = parseEither() run another parser
          'Either<$t, $t2>';
      case DocType(t): '$t'; //this is not  used
    };

Written like that it is now obvious why the compiler was complaining the second case was not being used.

Last but not least, because this grammar is sensitive to spaces, because they are separators of the sections, they need to be emitted as a token, and then the parser should handle the cases where a space makes sense and where it does not.

R32 · March 10, 2023, 5:25pm

I don’t kown much about parser, but I guessing SPC shouldn’t be placed in stream mathing from parser, SPC should be eliminated by lexer

danielo515 · March 13, 2023, 11:51am

If the grammar is sensitive to spaces, that is just another token to deal with. There is nothing special about which characters you should ignore or not.
Or you can think of any other mechanism to deal with the fact that parts of a doc comment are separated by spaces? I’m open to any suggestion that improves it.