What is this?
This is a library for making textmate grammars in a maintainable way. It has been the backbone of the Better C++ Syntax extension and the beta version of this library is the backbone of Better Docker syntax, the Better Perl syntax, and the Better Shell syntax.
What makes it more maintainable?
-
The full power of a programming language
- The first problem is grammars themselves are static. No variables, no functions, which means code duplication is rampant. The most basic feature of this library is providing you with a full/legit programming language (ruby) to develop grammars with.
- Why ruby? Textmate regex is based on Ruby regex. (But also there are 100 other reasons ruby is perfect for this library)
-
Truly modular patterns
- This was so much work to implement
- By default patterns effectively couldn't be modular/composable because the only way to give a pattern a name is to give it a number (a capture group number). But when you take
PatternAAA
and put it insidePatternBBB
, the numbers change because they're sequential andPatternBBB
has some groups of its own that offset all the group numbers inPatternAAA
. This library solves this problem, by doing all the heavy lifting of calculating/tracking what the capture numbers will be offset by in order to correctly attach the syntax-names to the correct number. You never have to deal with capture group numbers again.
-
No more double escaping regex
- Textmate uses regex, which has its own escape patterns.
But grammars are written in JSON/XML/tmLanguage formats which have a different set of escape patterns.
Double escaping regex inside a JSON string is an absolute nightmare - This library fixes that because ruby has regex-literals, and also string literals (also called raw strings). So you can finally just write what you mean (and you'll actually have syntax highlighting!)
- Textmate uses regex, which has its own escape patterns.
-
Regex has a more-readble option
- instead of
/(?<=thing)otherThing/
you can writelookBehindFor(/thing/).then(/otherThing/)
- instead of
/a?/
you can writemaybe(/a/)
- instead of
/(a|b|c)/
you can writeoneOf([ /a/, /b/, /c/ ])
- etc
- see here for some more examples
- readable regex has a LOT of support, including advanced tools like backreferences and atomic capture groups. Sadly we haven't had time to document all of the features yet.
- instead of
Does this make it better than Tree-sitter parsers?
NO.
- The Tree Sitter is vastly -- categorically -- superior in parsing capability, runtime efficiency, maintainability, and expressiveness-of-output. Sadly some editors only support textmate though and this tool is for them.
How do I use this?
Here's a quick run down.
gem install ruby_grammar_builder
Note: this gem has not been tested with Ruby 3
require "ruby_grammar_builder"
# create a new grammar
grammar = Grammar.new(
name: "C++",
scope_name: "source.cpp",
fileTypes: [
"cc",
"cpp",
"cp",
"cxx",
"c++",
"C",
"h",
"hh",
"hpp",
"h++"
],
version: "",
information_for_contributors: [
"This json file was auto generated by a much-more-readable ruby file",
"(e.g. don't edit it directly)",
],
)
# create a pattern for a keyword
grammar[:incorrect_misc_keyword] = Pattern.new(
match: /misc/,
tag_as: "keyword.other.misc",
# note: don't try matching newlines because Patterns cant match more than one line
# (this is NOT a library limitation, but a limitation of the Textmate engines that code editors use)
# below is both documentation e.g. "hey this is what the pattern should/shouldnt do"
# but it is also a unit test that actually makes sure those things are true
should_fully_match: [ "misc" ],
should_partial_match: [ "misc.", "= misc", ],
should_not_partial_match: [ "miscc", "Misc", "_misc" ],
)
# NOTE: ^ that will throw an error because it fails one of the test cases
grammar[:correct_misc_keyword] = Pattern.new(
match: @word_boundary.then(/misc/).then(@word_boundary),
# there are a handful of other helpers like @word_boundary, see: https://github.com/jeff-hykin/ruby_grammar_builder/blob/master/documentation/patterns.md
tag_as: "keyword.other.misc",
# note: don't try matching newlines because Patterns cant match more than one line
# (this is NOT a library limitation, but a limitation of the Textmate engines that code editors use)
# below is both documentation e.g. "hey this is what the pattern should/shouldnt do"
# but it is also a unit test that actually makes sure those things are true
should_fully_match: [ "misc" ],
should_partial_match: [ "misc.", "= misc", ],
should_not_partial_match: [ "miscc", "Misc", "_misc" ],
)
# create a pattern-range for something like string quotes
grammar[:string] = PatternRange.new(
tag_as: "string.quoted.double",
start_pattern: Pattern.new(
match: /"/,
tag_as: 'punctuation.definition.string'
),
end_pattern: Pattern.new(
match: /"/,
tag_as: 'punctuation.definition.string'
),
includes: [
:escape_pattern,
# even though we havent created :escape_pattern yet
# that is okay, because we're about to
# (this is a necessary feature for recursive patterns that include themselves)
],
)
grammar[:escape_pattern] = Pattern.new(
match: /\\./,
tag_as: "constant.character.escape",
)
#
# add them to "top level"
#
grammar[:$initial_context] = [
:misc_keyword, # <- first tries to find keyword
:string, # <- if that fails, it tries to find a string pattern
]
#
# export to a file
#
grammar.save_to(
syntax_name: "demo_syntax",
syntax_dir: "./syntaxes",
tag_dir: "./demo_syntax",
)
Here's an example of modular patterns:
quote = Pattern.new(
match: /"/,
tag_as: "quote",
)
smalltalk = Pattern.new(
match: /blah\/blah\/blah/,
tag_as: "string.smalltalk",
)
phrase = Pattern.new(
match: Pattern.new(/the man said: /).then(quote).then(smalltalk).then(quote),
tag_as: "other.phrase",
)
# NOTE: PatternRanges currently can't be put inside of Patterns. The textmate engine doesn't support this, and we have not found a good enough workaround yet
Using the generated grammar
If you want to use the grammar inside VS Code
- watch a quick tutorial on how to publish a VS Code extension
- And here is an example of what needs to be inside the package.json when publishing the output from this library
What names should I use for tag_as:
?
Please, please, please don't invent your own names if you don't have to. See this guide for coming up for names. Before that document there was effectively no standard, and it made theme-ing very hard.
Where's a more full example?
There's API documentation below, but searching for examples within the main/main.rb
of this project is probably the best way to understand how this library can be used.
API Documentation
If you already know about Textmate Grammars
(So if you happen to be one of the approximately 200 people on earth that have used textmate grammars) Something like this in a tmLanguage.json file
{
"match": "blah/blah/blah",
"name": "punctuation.separator.attribute",
"patterns": [
{
"include": "#evaluation_context"
},
]
}
Becomes this inside main.rb
Pattern.new(
match: /blah\/blah\/blah/,
tag_as: "punctuation.separator.attribute",
includes: [
:evaluation_context,
],
)
And things like this
{
"begin": "\\[\\[",
"end": "\\]\\]",
"beginCaptures": {
"0": {
"name": "punctuation.section.attribute.begin"
}
},
"endCaptures": {
"0": {
"name": "punctuation.section.attribute.end"
}
},
"name": "support.other.attribute",
"patterns": [
{
"include": "#attributes_context"
},
]
}
Become this
PatternRange.new(
start_pattern: Pattern.new(
match: /\[\[/,
tag_as: "punctuation.section.attribute.begin"
),
end_pattern: Pattern.new(
match: /\]\]/,
tag_as: "punctuation.section.attribute.end",
),
tag_as: "support.other.attribute",
# tag_content_as: "support.other.attribute", # <- alternative that doesnt double-tag the start/end
includes: [
:attributes_context,
]
)
To add something to the grammar's repository just do
grammar[:the_pattern_name] = Pattern.new(/blahblahblah/)
Where this gets really powerful is that you can nest/reuse patterns.
quote = Pattern.new(
match: /"/,
tag_as: "punctuation",
)
smalltalk = Pattern.new(
match: /blah\/blah\/blah/,
tag_as: "punctuation.separator.attribute",
includes: [
:evaluation_context,
],
)
phrase = Pattern.new(
match: Pattern.new(/the man said: /).then(quote).then(smalltalk).then(quote),
tag_as: "other.phrase",
)
For $base
and $self
(which I HIGHLY recommend AVOIDING) use includes: [ :$base, :$self ]
.
Readable Regex Guide
Regex is pretty hard to read, so this repo uses a library to help.
Pattern API Overview
-
Pattern.new(*attributes)
or.then(*attributes)
creates a new "shy" group- example:
Pattern.new(/foo/)
=> `/(?:foo)/
- example:
-
.or(*attributes)
adds an alternation (|
)- example:
Pattern.new(/foo/).or(/bar/)
=>/foo|(?:bar)/
- please note you may need more shy groups depending on order
Pattern.new(/foo/).or(/bar/).maybe(@spaces)
becomes (simplified)/(?:foo|bar)\s*/
NOT/(?:foo|bar\s*)/
- example:
-
maybe(*attributes)
or.maybe(*attributes)
causes the pattern to match zero or one times (?
)- example
maybe(/foo/)
=>/(?:foo)?/
- example
-
zeroOrMoreOf(*attributes)
or.zeroOrMoreOf(*attributes)
causes the pattern to be matched zero or more times (*
)- example
zeroOrMoreOf(/foo/)
=>/(?:foo)*/
- example
-
oneOrMoreOf(*attributes)
or.oneOrMoreOf(*attributes)
causes the pattern to be matched one or more times (+
)- example
oneOrMoreOf(/foo/)
=>/(?:foo)+/
- example
-
lookBehindFor(regex)
or.lookBehindFor(regex)
add a positive lookbehind- example
lookBehindFor(/foo/)
=>/(?<=foo)/
- example
-
lookBehindToAvoid(regex)
or.lookBehindToAvoid(regex)
add a negative lookbehind- example
lookBehindToAvoid(/foo/)
=>/(?<!foo)/
- example
-
lookAheadFor(regex)
or.lookAheadFor(regex)
add a positive lookahead- example
lookAheadFor(/foo/)
=>/(?=foo)/
- example
-
lookAheadToAvoid(regex)
or.lookAheadToAvoid(regex)
add a negative lookahead- example
lookAheadToAvoid(/foo/)
=>/(?!foo)/
- example
-
recursivelyMatch(reference)
or.recursivelyMatch(reference)
adds a regex subexpression- for example here's a pattern that would match
()
,(())
,((()))
, etc - as normal ruby-regex it would look like:
/(\(\g<1>\))/
Pattern.new(match: Pattern.new( "(" ).recursivelyMatch("foobar").or("").then( ")" ), reference: "foobar")
- NOTE: there is a known (rare) issue of creating a reference in a parent, and then using oneOf([]), and then trying to use
recursivelyMatch()
inside the oneOf(). Instead of usingoneOf([A,B])
, just useA.or(B)
to avoid this issue.
- for example here's a pattern that would match
-
matchResultOf(reference)
or.matchResultOf(reference)
adds a backreference- example
Pattern.new(match: /foo|bar/, reference: "foobar").matchResultOf("foobar")
=>/(foo|bar)\1/
- matches:
foofoo
andbarbar
but notfoobar
- example
Pattern API Details
- The
*attributes
can be:- A regular expression:
Pattern.new(/stuff/)
- Another pattern:
Pattern.new(Pattern.new(/blah/))
) - Or a bunch of named arguments:
Pattern.new({ match: /stuff/, })
- A regular expression:
Here's a comprehesive list of named arguments (not all can be used together)
Pattern.new(
# unit tests
should_partial_match: [ "example text", ],
should_fully_match: [ "example text", ],
should_not_partial_match: [ "example text", ],
should_not_fully_match: [ "example text", ],
# NOTE! `should_not_fully_match` does mean `should_partial_match`
# its just an "IF does full match THEN throw error"
# typical arguments
match: //, # regex or another pattern
tag_as: "", # string; a textmate scope
# Whats a textmate scope? see https://code.visualstudio.com/api/language-extensions/syntax-highlight-guide#textmate-tokens-and-scopes)
# NOTE! this grammar contains two special scope-tools
# $match and $reference()
# for example:
# Pattern.new(
# match: /and|or|not/, # keyword operator names
# tag_as: "keyword.operator.wordlike.$match",
# # ^this will be equivlent to:
# # "keyword.operator.wordlike.and"
# # "keyword.operator.wordlike.or"
# # "keyword.operator.wordlike.not"
# # depending on which one is matched at runtime
# )
# $reference() is very similar to $match
# Pattern.new(
# reference: "blah",
# match: /and|or|not/, # keyword operator names
# tag_as: "keyword.operator.wordlike.$reference(blah)",
# )
includes: [
:other_pattern_name,
# alternatively include Pattern.new OR PatternRange.new directly
PatternRange.new(
# stuff
),
],
# NOTE! if "includes:" is used then Textmate will ignore any sub-"tag_as"
# if "match:" is regex then this is not a problem (there are no sub-"tag_as"'s)
# BUT something like match: Pattern.new(match:/sub-thing1/,).then(match:/sub-thing2/, tag_as: "blah")
# then the tag_as: "blah" will get
# and instead let the included patterns do all the tagging
#
# repetition arguments
#
at_least: 3.times,
at_most: 5.times,
how_many_times?: 5.times, # repeat exactly 5 times
# the follow two only works in repeating patterns (like at_least: ... or zeroOrMoreOf(), or oneOrMoreOf())
as_few_as_possible?: false,
# equivlent to regex lazy option
# default value is false
# see https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions
# "as_few_as_possible?:" has an equivlent alias "lazy?:"
dont_back_track?: false,
# this is equivlent to regex atomic groups (can be efficient)
# default value is false
# http://www.rexegg.com/regex-disambiguation.html#atomic
# "dont_back_track?:" has an equivlent alias "possessive?:"
#
# advanced
#
word_cannot_be_any_of: [ "word1" ], # default=[]
# this is highly useful for matching var-names while not matching builtin-keywords
# HOWEVERY only use this if
# 1. you're matching the whole word, not a small part of a word
# 2. the pattern always matches something (the pattern cant match an empty string)
# 3. what you consider a "word" matches the regex boundary's (\b) definition, meaning;
# underscore is not a seperator, dash is a seperator, etc
# this has limited usecase but is very useful when needed
reference: "", # to create a name that can be referenced later for (regex backreferences)
preserve_references?: false, # default=false, setting to true will allow for
# reference name conflict. Usually the names are scrambled to prevent name-conflict
comment: "", # a comment that will show up in the final generated-grammar file (rarely used)
#
# internal API (dont use directly, but listed here for comprehesiveness sake)
#
backreference_key: "reference_name", # use matchResultOf(), which is equivlent to regex backreference
subroutine_key: "reference_name", # use recursivelyMatch(), which is equivlent to regex subroutine
type: :lookAheadFor, # only valid values are :lookAheadFor, :lookAheadToAvoid, :lookBehindFor, :lookBehindToAvoid
# just used as a means of implementing lookAheadFor(), lookAheadToAvoid(), etc
placeholder: "name", # useful for recursive includes or patterns; grammar[:a_pattern] will return a placeholder
# if the pattern has not been created yet (e.g. grammar[:a_pattern] = a_pattern)
# when a grammar is exported unresolved placeholders will throw an error
patterns: [ Pattern.new() ], # this is used to implement oneOf()
adjectives: [ :isAKeyword, ], # a list of adjectives that describe the pattern, part of an untested grammar.tokenMatching() feature
pattern_filter: ->(pattern) {}, # part of untested grammar.tokenMatching() feature, only works with placeholders
)
PatternRange
PatternRange.new
is used to create a begin/end pattern rule.
# All available arguments (can't all be used at same time)
PatternRange.new(
# typical aguments
tag_as: "",
start_pattern: Pattern.new(),
end_pattern: Pattern.new(),
includes: [],
# unit testing arguments
should_partial_match: [],
should_not_partial_match: [],
should_fully_match: [],
should_not_fully_match: [],
#
# advanced options
#
tag_contents_as: "", # NOTE; this is an alternative to "tag_as:" not to be used in combination
while_pattern: Pattern.new(),
# replaces "end_pattern" but the underlying behavior is strange, see: https://github.com/jeff-hykin/fornix/blob/74272281599174dcfc4ef163b770b2d5a1c5dc05/documentation/library/textmate_while.md#L1
apply_end_pattern_last: false,
# default=false, rarely used but can be important
# see https://www.apeth.com/nonblog/stories/textmatebundle.html
# also has an alias "end_pattern_last:"
# also (for legacy reasons) has an alias "applyEndPatternLast:"
tag_start_as: "", # not really used, instead just do start_pattern: Pattern.new(match:/blah/, tag_as:"blah")
tag_end_as: "", # not really used, instead just do end_pattern: Pattern.new(match:/blah/, tag_as:"blah")
tag_while_as: "", # not really used, instead just do while_pattern: Pattern.new(match:/blah/, tag_as:"blah")
)
Unit Testing
By supplying one of the unit testing keys to the pattern, you can ensure that pattern only matches what you want it to.
-
should_partial_match
asserts that the pattern matches anywhere in the test strings -
should_not_partial_match
asserts that the pattern does not match at all in the test strings. -
should_fully_match
asserts that the pattern matches all the characters in the test strings. -
should_not_fully_match
asserts that the pattern does not match all the characters in the test strings.- note:
should_not_fully_match
does not implyshould_partial_match
, that is a failure to match satisfiesshould_not_fully_match
- note:
Setup for Contributing to the library
Everything is detailed in the documentation/setup.md
!