What is this?

This is a library for making textmate grammars in a maintainable way. It has been the backbone of the Better C++ Syntax extension and the beta version of this library is the backbone of Better Docker syntax, the Better Perl syntax, and the Better Shell syntax.

What makes it more maintainable?

The full power of a programming language
- The first problem is grammars themselves are static. No variables, no functions, which means code duplication is rampant. The most basic feature of this library is providing you with a full/legit programming language (ruby) to develop grammars with.
- Why ruby? Textmate regex is based on Ruby regex. (But also there are 100 other reasons ruby is perfect for this library)
Truly modular patterns
- This was so much work to implement
- By default patterns effectively couldn't be modular/composable because the only way to give a pattern a name is to give it a number (a capture group number). But when you take PatternAAA and put it inside PatternBBB, the numbers change because they're sequential and PatternBBB has some groups of its own that offset all the group numbers in PatternAAA. This library solves this problem, by doing all the heavy lifting of calculating/tracking what the capture numbers will be offset by in order to correctly attach the syntax-names to the correct number. You never have to deal with capture group numbers again.
No more double escaping regex
- Textmate uses regex, which has its own escape patterns.
  But grammars are written in JSON/XML/tmLanguage formats which have a different set of escape patterns.
  Double escaping regex inside a JSON string is an absolute nightmare
- This library fixes that because ruby has regex-literals, and also string literals (also called raw strings). So you can finally just write what you mean (and you'll actually have syntax highlighting!)
Regex has a more-readble option
- instead of /(?<=thing)otherThing/ you can write lookBehindFor(/thing/).then(/otherThing/)
- instead of /a?/ you can write maybe(/a/)
- instead of /(a|b|c)/ you can write oneOf([ /a/, /b/, /c/ ])
- etc
- see here for some more examples
- readable regex has a LOT of support, including advanced tools like backreferences and atomic capture groups. Sadly we haven't had time to document all of the features yet.

Does this make it better than Tree-sitter parsers?

NO.

The Tree Sitter is vastly -- categorically -- superior in parsing capability, runtime efficiency, maintainability, and expressiveness-of-output. Sadly some editors only support textmate though and this tool is for them.

How do I use this?

Here's a quick run down.

gem install ruby_grammar_builder Note: this gem has not been tested with Ruby 3

require "ruby_grammar_builder"

# create a new grammar
grammar = Grammar.new(
    name: "C++",
    scope_name: "source.cpp",
    fileTypes: [
        "cc",
        "cpp",
        "cp",
        "cxx",
        "c++",
        "C",
        "h",
        "hh",
        "hpp",
        "h++"
    ],
    version: "",
    information_for_contributors: [
        "This json file was auto generated by a much-more-readable ruby file",
	"(e.g. don't edit it directly)",
    ],
)

# create a pattern for a keyword
grammar[:incorrect_misc_keyword] = Pattern.new(
    match: /misc/,
    tag_as: "keyword.other.misc",
    # note: don't try matching newlines because Patterns cant match more than one line
    # (this is NOT a library limitation, but a limitation of the Textmate engines that code editors use)
    
    # below is both documentation e.g. "hey this is what the pattern should/shouldnt do"
    # but it is also a unit test that actually makes sure those things are true
    should_fully_match: [ "misc" ],
    should_partial_match: [ "misc.", "= misc", ],
    should_not_partial_match: [ "miscc", "Misc", "_misc" ],
)
# NOTE: ^ that will throw an error because it fails one of the test cases
grammar[:correct_misc_keyword] = Pattern.new(
    match: @word_boundary.then(/misc/).then(@word_boundary),
    # there are a handful of other helpers like @word_boundary, see: https://github.com/jeff-hykin/ruby_grammar_builder/blob/master/documentation/patterns.md
    tag_as: "keyword.other.misc",
    # note: don't try matching newlines because Patterns cant match more than one line
    # (this is NOT a library limitation, but a limitation of the Textmate engines that code editors use)
    
    # below is both documentation e.g. "hey this is what the pattern should/shouldnt do"
    # but it is also a unit test that actually makes sure those things are true
    should_fully_match: [ "misc" ],
    should_partial_match: [ "misc.", "= misc", ],
    should_not_partial_match: [ "miscc", "Misc", "_misc" ],
)

# create a pattern-range for something like string quotes
grammar[:string] = PatternRange.new(
    tag_as: "string.quoted.double",
    start_pattern: Pattern.new(
        match: /"/,
        tag_as: 'punctuation.definition.string'
    ),
    end_pattern: Pattern.new(
        match: /"/,
        tag_as: 'punctuation.definition.string'
    ),
    includes: [
        :escape_pattern,
        # even though we havent created :escape_pattern yet
        # that is okay, because we're about to
        # (this is a necessary feature for recursive patterns that include themselves)
    ],
)
grammar[:escape_pattern] = Pattern.new(
    match: /\\./,
    tag_as: "constant.character.escape",
)


# 
# add them to "top level"
# 
grammar[:$initial_context] = [
    :misc_keyword, # <- first tries to find keyword
    :string,       # <- if that fails, it tries to find a string pattern
]


# 
# export to a file
# 
grammar.save_to(
    syntax_name: "demo_syntax",
    syntax_dir: "./syntaxes",
    tag_dir: "./demo_syntax",
)

Here's an example of modular patterns:

quote = Pattern.new(
    match: /"/,
    tag_as: "quote",
)

smalltalk = Pattern.new(
    match: /blah\/blah\/blah/,
    tag_as: "string.smalltalk",
)

phrase = Pattern.new(
    match: Pattern.new(/the man said: /).then(quote).then(smalltalk).then(quote),
    tag_as: "other.phrase",
)

# NOTE: PatternRanges currently can't be put inside of Patterns. The textmate engine doesn't support this, and we have not found a good enough workaround yet

Using the generated grammar

If you want to use the grammar inside VS Code

watch a quick tutorial on how to publish a VS Code extension
And here is an example of what needs to be inside the package.json when publishing the output from this library

What names should I use for `tag_as:`?

Please, please, please don't invent your own names if you don't have to. See this guide for coming up for names. Before that document there was effectively no standard, and it made theme-ing very hard.

Where's a more full example?

There's API documentation below, but searching for examples within the main/main.rb of this project is probably the best way to understand how this library can be used.

API Documentation

If you already know about Textmate Grammars

(So if you happen to be one of the approximately 200 people on earth that have used textmate grammars) Something like this in a tmLanguage.json file

{
    "match": "blah/blah/blah",
    "name": "punctuation.separator.attribute",
    "patterns": [
        {
          "include": "#evaluation_context"
        },
    ]
}

Becomes this inside main.rb

Pattern.new(
    match: /blah\/blah\/blah/,
    tag_as: "punctuation.separator.attribute",
    includes: [
        :evaluation_context,
    ],
)

And things like this

{
    "begin": "\\[\\[",
    "end": "\\]\\]",
    "beginCaptures": {
        "0": {
            "name": "punctuation.section.attribute.begin"
        }
    },
    "endCaptures": {
        "0": {
            "name": "punctuation.section.attribute.end"
        }
    },
    "name": "support.other.attribute",
    "patterns": [
        {
            "include": "#attributes_context"
        },
    ]
}

Become this

PatternRange.new(
    start_pattern: Pattern.new(
            match: /\[\[/,
            tag_as: "punctuation.section.attribute.begin"
        ),
    end_pattern: Pattern.new(
            match: /\]\]/,
            tag_as: "punctuation.section.attribute.end",
        ),
    tag_as: "support.other.attribute",
    # tag_content_as: "support.other.attribute", # <- alternative that doesnt double-tag the start/end
    includes: [
        :attributes_context,
    ]
)

To add something to the grammar's repository just do

grammar[:the_pattern_name] = Pattern.new(/blahblahblah/)

Where this gets really powerful is that you can nest/reuse patterns.

quote = Pattern.new(
    match: /"/,
    tag_as: "punctuation",
)

smalltalk = Pattern.new(
    match: /blah\/blah\/blah/,
    tag_as: "punctuation.separator.attribute",
    includes: [
        :evaluation_context,
    ],
)

phrase = Pattern.new(
    match: Pattern.new(/the man said: /).then(quote).then(smalltalk).then(quote),
    tag_as: "other.phrase",
)

For $base and $self (which I HIGHLY recommend AVOIDING) use includes: [ :$base, :$self ].

Readable Regex Guide

Regex is pretty hard to read, so this repo uses a library to help.

Pattern API Overview

Pattern.new(*attributes) or .then(*attributes) creates a new "shy" group
- example: Pattern.new(/foo/) => `/(?:foo)/
.or(*attributes) adds an alternation (|)
- example: Pattern.new(/foo/).or(/bar/) => /foo|(?:bar)/
- please note you may need more shy groups depending on order Pattern.new(/foo/).or(/bar/).maybe(@spaces) becomes (simplified) /(?:foo|bar)\s*/ NOT /(?:foo|bar\s*)/
maybe(*attributes) or .maybe(*attributes) causes the pattern to match zero or one times (?)
- example maybe(/foo/) => /(?:foo)?/
zeroOrMoreOf(*attributes) or .zeroOrMoreOf(*attributes) causes the pattern to be matched zero or more times (*)
- example zeroOrMoreOf(/foo/) => /(?:foo)*/
oneOrMoreOf(*attributes) or .oneOrMoreOf(*attributes) causes the pattern to be matched one or more times (+)
- example oneOrMoreOf(/foo/) => /(?:foo)+/
lookBehindFor(regex) or .lookBehindFor(regex) add a positive lookbehind
- example lookBehindFor(/foo/) => /(?<=foo)/
lookBehindToAvoid(regex) or .lookBehindToAvoid(regex) add a negative lookbehind
- example lookBehindToAvoid(/foo/) => /(?<!foo)/
lookAheadFor(regex) or .lookAheadFor(regex) add a positive lookahead
- example lookAheadFor(/foo/) => /(?=foo)/
lookAheadToAvoid(regex) or .lookAheadToAvoid(regex) add a negative lookahead
- example lookAheadToAvoid(/foo/) => /(?!foo)/
recursivelyMatch(reference) or .recursivelyMatch(reference) adds a regex subexpression
- for example here's a pattern that would match (), (()), ((())), etc
- as normal ruby-regex it would look like: /($\g<1>$)/
- Pattern.new(match: Pattern.new( "(" ).recursivelyMatch("foobar").or("").then( ")" ), reference: "foobar")
- NOTE: there is a known (rare) issue of creating a reference in a parent, and then using oneOf([]), and then trying to use recursivelyMatch() inside the oneOf(). Instead of using oneOf([A,B]), just use A.or(B) to avoid this issue.
matchResultOf(reference) or .matchResultOf(reference) adds a backreference
- example Pattern.new(match: /foo|bar/, reference: "foobar").matchResultOf("foobar") => /(foo|bar)\1/
- matches: foofoo and barbar but not foobar

Pattern API Details

The *attributes can be:
- A regular expression: Pattern.new(/stuff/)
- Another pattern: Pattern.new(Pattern.new(/blah/)))
- Or a bunch of named arguments: Pattern.new({ match: /stuff/, })

Here's a comprehesive list of named arguments (not all can be used together)

Pattern.new(
    # unit tests
    should_partial_match: [ "example text", ],
    should_fully_match:   [ "example text", ],
    should_not_partial_match: [ "example text", ],
    should_not_fully_match:   [ "example text", ],
    # NOTE! `should_not_fully_match` does mean `should_partial_match`
    #       its just an "IF does full match THEN throw error"
    
    # typical arguments
    match: //,     # regex or another pattern
    tag_as: "",    # string; a textmate scope
    # Whats a textmate scope? see https://code.visualstudio.com/api/language-extensions/syntax-highlight-guide#textmate-tokens-and-scopes)
    # NOTE! this grammar contains two special scope-tools
    # $match and $reference()
    # for example: 
        # Pattern.new(
        #     match: /and|or|not/, # keyword operator names
        #     tag_as: "keyword.operator.wordlike.$match",
        #     # ^this will be equivlent to:
        #     #   "keyword.operator.wordlike.and"
        #     #   "keyword.operator.wordlike.or"
        #     #   "keyword.operator.wordlike.not"
        #     # depending on which one is matched at runtime
        # )
    # $reference() is very similar to $match
        # Pattern.new(
        #     reference: "blah",
        #     match: /and|or|not/, # keyword operator names
        #     tag_as: "keyword.operator.wordlike.$reference(blah)",
        # )
    includes: [
        :other_pattern_name,
        # alternatively include Pattern.new OR PatternRange.new directly
        PatternRange.new(
            # stuff 
        ),
    ],
    # NOTE! if "includes:" is used then Textmate will ignore any sub-"tag_as"
    #       if "match:" is regex then this is not a problem (there are no sub-"tag_as"'s)
    #       BUT something like match: Pattern.new(match:/sub-thing1/,).then(match:/sub-thing2/, tag_as: "blah")
    #       then the tag_as: "blah" will get 
    #       and instead let the included patterns do all the tagging
    
    
    # 
    # repetition arguments
    # 
    at_least: 3.times,
    at_most: 5.times,
    how_many_times?: 5.times, # repeat exactly 5 times
    
    # the follow two only works in repeating patterns (like at_least: ... or zeroOrMoreOf(), or oneOrMoreOf())
    as_few_as_possible?: false,
    # equivlent to regex lazy option
    # default value is false
    # see https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions
    # "as_few_as_possible?:" has an equivlent alias "lazy?:"
    dont_back_track?: false,
    # this is equivlent to regex atomic groups (can be efficient)
    # default value is false
    # http://www.rexegg.com/regex-disambiguation.html#atomic
    # "dont_back_track?:" has an equivlent alias "possessive?:"
    
    # 
    # advanced
    # 
    word_cannot_be_any_of: [ "word1" ], # default=[]
    # this is highly useful for matching var-names while not matching builtin-keywords
    # HOWEVERY only use this if
    # 1. you're matching the whole word, not a small part of a word
    # 2. the pattern always matches something (the pattern cant match an empty string)
    # 3. what you consider a "word" matches the regex boundary's (\b) definition, meaning;
    #    underscore is not a seperator, dash is a seperator, etc
    # this has limited usecase but is very useful when needed
    
    reference: "", # to create a name that can be referenced later for (regex backreferences)
    preserve_references?: false, # default=false, setting to true will allow for 
    # reference name conflict. Usually the names are scrambled to prevent name-conflict
    comment: "", # a comment that will show up in the final generated-grammar file (rarely used)
    
    # 
    # internal API (dont use directly, but listed here for comprehesiveness sake)
    # 
    backreference_key: "reference_name", # use matchResultOf(), which is equivlent to regex backreference
    subroutine_key: "reference_name",    # use recursivelyMatch(), which is equivlent to regex subroutine
    type: :lookAheadFor, # only valid values are :lookAheadFor, :lookAheadToAvoid, :lookBehindFor, :lookBehindToAvoid
                         # just used as a means of implementing lookAheadFor(), lookAheadToAvoid(), etc
    placeholder: "name", # useful for recursive includes or patterns; grammar[:a_pattern] will return a placeholder
                         # if the pattern has not been created yet (e.g. grammar[:a_pattern] = a_pattern)
                         # when a grammar is exported unresolved placeholders will throw an error
    patterns: [ Pattern.new() ], # this is used to implement oneOf()
    adjectives: [ :isAKeyword, ],   # a list of adjectives that describe the pattern, part of an untested grammar.tokenMatching() feature
    pattern_filter: ->(pattern) {}, # part of untested grammar.tokenMatching() feature, only works with placeholders
)

PatternRange

PatternRange.new is used to create a begin/end pattern rule.

# All available arguments (can't all be used at same time)
PatternRange.new(
    # typical aguments
    tag_as:  "",
    start_pattern: Pattern.new(),
    end_pattern:  Pattern.new(),
    includes: [],
    
    # unit testing arguments
    should_partial_match: [],
    should_not_partial_match: [],
    should_fully_match: [],
    should_not_fully_match: [],
    
    # 
    # advanced options
    # 
    tag_contents_as: "", # NOTE; this is an alternative to "tag_as:" not to be used in combination
    while_pattern: Pattern.new(),
    # replaces "end_pattern" but the underlying behavior is strange, see: https://github.com/jeff-hykin/fornix/blob/74272281599174dcfc4ef163b770b2d5a1c5dc05/documentation/library/textmate_while.md#L1
    apply_end_pattern_last: false,
    # default=false, rarely used but can be important
    # see https://www.apeth.com/nonblog/stories/textmatebundle.html
    # also has an alias "end_pattern_last:"
    # also (for legacy reasons) has an alias "applyEndPatternLast:"
    
    tag_start_as: "", # not really used, instead just do start_pattern: Pattern.new(match:/blah/, tag_as:"blah")
    tag_end_as:   "", # not really used, instead just do end_pattern: Pattern.new(match:/blah/, tag_as:"blah")
    tag_while_as: "", # not really used, instead just do while_pattern: Pattern.new(match:/blah/, tag_as:"blah")
)

Unit Testing

By supplying one of the unit testing keys to the pattern, you can ensure that pattern only matches what you want it to.

should_partial_match asserts that the pattern matches anywhere in the test strings
should_not_partial_match asserts that the pattern does not match at all in the test strings.
should_fully_match asserts that the pattern matches all the characters in the test strings.
should_not_fully_match asserts that the pattern does not match all the characters in the test strings.
- note: should_not_fully_match does not imply should_partial_match, that is a failure to match satisfies should_not_fully_match

Setup for Contributing to the library

Everything is detailed in the documentation/setup.md!

ruby_grammar_builder

Runtime