Reg is a library for pattern matching in ruby data structures. Reg provides Regexp-like match and match-and-replace for all data structures (particularly Arrays, Objects, and Hashes), not just Strings. Reg is best thought of in analogy to regular expressions; Regexps are special data structures for matching Strings; Regs are special data structures for matching ANY type of ruby data (Strings included, using Regexps). If you have any questions, comments, problems, new feature requests, or just want to figure out how to make it work for what you need to do, contact me: reg _at_ inforadical _dot_ net Reg is a RubyForge project. RubyForge is another good place to send your bug reports or whatever: http://rubyforge.org/projects/reg/ The implementation: The engine (according to what I can tell from Friedl's book, _Mastering_Regular_Expressions_,) is a traditional DFA with non-greedy alternation. For performance, I'd like to move to a more NFA-oriented approach (trying many different alternatives in parallel). Status: The only real (public) matching operator implemented thus far is: Reg::Reg#=== (and descendants). It doesn't return a normalized boolean; it will return a false value on no match or a true val if there was a match but beyond that, nothing is guaranteed. A number of important features are unimplemented at this point, most notably backreferences and substitutions. The backtracking engine appears to be completely functional now. Vector Reg::And doesn't work. This release should be much faster, for 2 reasons. First, the cursor library has been dropped in favor of sequence, which is much faster. Second, and more important, the interpreted backtracking engine has been replaced with a compiled engine. This means completely new implementations of Reg::Array and all the vector matchers. (I tried to write compilers for Reg::Hash and Reg:: Object, but they didn't get completed...) The majority of my concerns about performance are now resolved, although the backtracking algorithm is still very simplistic, and could do with a good dose of fixed match cognizance. This table compares syntax of Reg and Regexp for various constructs. Keep in mind that all Regs are ordinary ruby expressions. The special syntax is acheived by overriding ruby operators. In the following examples, re,re1,re2 represent arbitrary regexp subexpressions, r,r1,r2 represent arbitrary reg subexpressions s,t represent any single character (perhaps appropriately escaped, if the char is magical) reg regexp reg class #description +[r1,r2,r3] /re1re2re3/ Reg::Array #sequence -[r1,r2] (re1re2) Reg::Subseq #subsequence r.lit \re Reg::Literal #escaping a magical regproc{r} #{re} (not really named) #dynamic inclusion r1|r2 or :OR (re1|re2) or [st] Reg::Or #alternation ~r [^s] Reg::Not #negation (for scalar r and s) r.* re* Reg::Repeat #zero or more matches r.+ re+ Reg::Repeat #one or more matches r.- re? Reg::Repeat #zero or one matches r*n re{n} Reg::Repeat #exactly n matches r*(n..m) re{n,m} Reg::Repeat #at least n, at most m matches r-n re{n,} Reg::Repeat #at most n matches r+m re{,m} Reg::Repeat #at least m matches OB . Reg::Any #a single item OBS .* Reg::AnyMultiple #zero or more items BR(1,2) \1,\2 Reg::Backref #backreference *** r>>x or sub sub,gsub Reg::Transform #search and replace *** :a<<r () Reg::Bound #capture into a backreference *** here are features of reg that don't have an equivalent in regexp r.la Reg::Lookahead #lookahead *** ~-[] Reg::Lookahead #subsequence negation w/lookahead *** & or :AND Reg::And #all alternatives match ^ or :XOR Reg::Xor #exactly one of alternatives matches +{r1=>r2} Reg::Hash #hash matcher -{name=>r} Reg::Object #object matcher obj.reg Reg::Fixed #turn any ruby object into a reg that matches if obj.=== succeeds /re/.sym Reg::Symbol #a symbol regex item_that{|x|rcode} Reg::ItemThat #a proc{} that responds to === by invoking the proc's call OBS as un-anchor Reg::AnyMultiple #opposite of ^ and $ when placed at edges of a reg array (kinda cheesy) name=r (just a var assign) #named subexpressions Reg::var Reg::Variable #recursive matches via Reg::Variable & Reg::Constants Reg::const Reg::Constant *** = not implemented yet. "... the effect of drinking a Pan Galactic Gargle Blaster is like having your brains smashed out by a slice of lemon wrapped round a large gold brick." -- Douglas Adams, _Hitchhiker's_Guide_to_the_Galaxy_ Reg is kind of hard to bend your brain around, so here are some examples: 0.4.6 examples: Matches a single item whose method 'length' returns a Fixnum: item_that.length.is_a? Fixnum There's a new way to match hashes; it looks more-or-less like the old way and behaves a little differently. The old type of hash matcher (now called an unordered hash matcher) looked like: +{/fo+/=>8, /ba+r/=>9} The new syntax uses +[] instead of +{} and ** instead of =>. It's called an ordered hash matcher. The order of filter pairs given in an ordered matcher is the order comparisons are done in. The same is not true within unordered matchers, where order is inferred from the nature of the key matchers. The ordered equivalent of the last example is: +[/fo+/**8, /ba+r/**9] Both match hashes whose keys match /fo+/ with value of 8 or match /ba+r/ with value of 9 (and nothing else). But if the data looks like: {"foobar"=>8}, then it is guaranteed to match the second (because /fo+/ is always given a chance first), but might or might not match the first (because the order is unspecified). Here's an example of a Reg::Knows matcher, which matches objects that have the #slice method: -:slice 0.4.5 examples: Matches array containing exactly 2 elements; 1st is another array, 2nd is integer: +[Array,Integer] Like above, but 1st is array of arrays of symbol +[+[+[Symbol+0]+0],Integer] Matches array of at least 3 consecutive symbols and nothing else: +[Symbol+3] Matches array with at least 3 symbols in it somewhere: +[OBS, Symbol+3, OBS] Matches array of at most 6 strings starting with 'g' +[/^g/-6] #no .reg necessary for regexp Matches array of between 5 and 9 hashes containing a key :k pointing to something non-nil: +[ +{:k=>~nil.reg}*(5..9) ] Matches an object with Integer instance variable @k and property (ie method) foobar that returns a string with 'baz' somewhere in it: -{:@k=>Integer, :foobar=>/baz/} Matches array of 6 hashes with 6 as a value of every key, followed by 18 objects with an attribute @s which is a String: +[ +{OB=>6}*6, -{:@s=>String}*18 ] Api changes since 0.4.5: Reg::Hash semantics have been changing recently.... Reg::Object may be changed to suit. Api changes since 0.4.0: Reg() makes Reg::Arrays now, not hash matchers; use Rah to make hashes. Array#reg and Hash#reg no longer return a Reg::Array or Reg::Hash. In fact the names of most classes have changed; they've been moved into the Reg namespace (aka module). The previous Reg module is now named Reg::Reg. The other names have changed in the obvious way. RegArray is now Reg::Array, etc. For the most part these changes don't affect users (if any) because they leave the shortest representation (the mini-language) unaffected. The one exception (where you have to refer to a reg module name) is the name of the module Reg, which is now Reg::Reg. If anyone has 'include Reg' in their class or module to get all of Reg's yummy operators, look out it's changed to 'include Reg::Reg' instead. Aliases are mostly provided from the new to the old class names... but an alias from Reg to Reg::Reg obviously creates a conflict. the api (mostly unimplemented): r represents a reg t represents a transform o represents any object a represents an array s represents a string h represents a hash scan represents the entire stringscanner interface... -(scan,skip,match?,check and their unanchored and backward forms) c represents a ::Sequence ! implies in-place modification r===o #v r=~o #v ach=~r #v- r.match o #result contains changes r.match! o coah.sub!(r[,t]) coah.gsub!(r[,t]) oah.sub(r[,t]) #modifies in result oah.gsub(r[,t]) #modifies in result a.scan(r) #modifies in result c.index/rindex r #no modify c.slice r #no modify c.slice! r #deletes matching elems c.split r #no modify c.find_all r #like String#scan c.find r ho.find_all [r-key,] r-value ho.find [r-key,] r-value ho.index r a.split r s.find_all r s.find r s.delete r s.delete! r s.delete_all r s.delete_all! r #these require wrapping library methods to also take different args ac.slice r ahoc.slice! r o=~r oahc[r] oahc[r]=t c.scan(r) a.find_all r a.find r #i'd like to have these, but they can't safely be wrapped, #so i'll have to think of different names. as.index/rindex r #=> offset/roffset ...use exist?/existback? instead s.slice r #=> rslice s.slice! r s.split r #=> rsplit s[r] #=> s-[r] s[r]=t #=> s-[r,t] s.sub(r[,t]) #=> rsub s.gsub(r[,t]) #=> grsub s.sub!(r[,t]) #etc s.gsub!(r[,t]) s.scan(r) #=> rscan... note scan only conflicts; the rest of the stringscanner interface # can be unchanged. #maybe stuff from Enumerable? Reg::Progress work list: phase 1: array only v fill out backtrack v import asserts from backtrace=>backtrack v disable backtrace backtrack should respect update_di v callers of backtrace must use a progress instead v call backtrack on progress instead of backtrace... v matchsets unmodified as yet (ok, except repeat and subseq matchsets) v push_match and push_matchset need to be called in right places in Reg::Array (what else?) note which parts of regarray.rb have been obsoleted by regprogress.rb phase 2: eventually, MatchSet#next_match will take a cursor parameter, and return a number of items consumed or progress or nil x entering some types of subreg creates a subprogress arrange for process_deferreds to be called in the right places create Reg::Bound (for vars) and Reg::SideEffect, Reg::Undo, Reg::Eventually with sugar -Reg#bind, Reg#side_effect, Reg#undo, Reg#eventually -and of course Reg::Transform and Reg::Replace -Reg::Reg#>>(Reg::Replace) makes a Transform, and certain things can mix in module Replace create Reg::BackRef should Reg::BackRef be a module? should Reg::BackRef be a Deferred? Reg::Transform calls Reg::Progress#eventually? implicit progress needs to be made when doing standalone compare of -Reg::Object, Reg::Hash, Reg::Array, Reg::logicals, Reg::Bound, Reg::Transform, maybe others these are stubbed at least now: Backtrace.clean_result and Backtrace.check_result should operate on progresses instead v need Reg::Progress#bt_match,last_next_match,to_result,check_result,clean_result x need Reg::Progress#deep_copy for use in repeat and subseq matchsets need MatchSet#clean_result which delegates to the internal Progress, if any v rewrite repeat and subseq to use progress internally? (in progress only...) v Reg::(and,repeat,subseq,array) require progress help varieties of Reg::Replace: Reg::Backref and Reg::Bound Reg::RepProc Reg::ItemThat Reg::Fixed Object (as if Reg::Fixed) Reg::Array and Reg::Subseq? Array (as if Reg::Array) Reg::Transform? not implemented yet: Reg::Anchor? (or more efficient unanchor?) Reg::Backref should be multiple if the items it backreferences to were multiple Reg::NumberSet There are a few optimizations implemented currently, but none of them are particularly significant. Some things will probably be quite slow. All of the optimizations Friedl lists for regular expressions are pertinant to Regs as well. Hopefully, someday, they will be implemented. For the record, they are: first item discrimination (special case of match cognizance) fixed-sequence check simple repetition (some is implemented) fixed qualifier reduction (??) length cognizance match cognizance need cognizance anchors (edge cognizance) todo: v move position_stack into Progress::Context v move matchfail_todo into Progress::Context v move matchset_stack into context v all matchsets should reference a Progress v all matchsets should reference a Context (except maybe SingleMatch_MatchSet?) v MatchSet constructors must take a progress matchset#next_match's should use @progress/@context instead of passed in arr/start v replace subprogress calls with newcontext/endcontext v newcontext/endcontext needs to be used in other contexts too! (Reg::Array, Reg::Object, etc) v need to backtrack in nexted Reg::Array when backup_stacks is called (maybe indirectly) in a MatchSet's #next_match, should it affect the -@progress or the @context of that MatchSet? inspect all uses of position_stack and position_inc_stack for similar problems array_like/hash_like/object_like as aliases for +[]/+{}/-{} why isn't ArrayGraphPoint ever used? it should be. === sometimes can raise an exception! (eg: ("r".."s")===[]) -make sure all calls to === are protected by appending 'rescue false' to them. vector Reg::Proc,Reg::ItemThat,Reg::Block,Reg::Variable,Reg::Constant convert mmatch_full to mmatch in another class (or module) in logicals, subseq, repeat, etc? performance variable binding variable tracking... keeping each value assigned to a variable during the match in an array compare string or file to Reg::Array (lexing) rename Reg::Multiple to Reg::Vector v rename proceq to item_that (& change conventions... item_that will return a CallLater) ?implement Object#[](Reg::Reg) and Object#[]=(Reg::Reg, Reg::Replacement) in-place substitutions should not be performed when Reg::Reg#=== or Reg::Reg#match called -only when Array#sub or Array#[]=(Reg::Reg,Reg::Replacement) perhaps Reg::Reg#match! does substitutions... substitutions are applied to result of Array#[], but orig is not modified v what about =~? v research Txl and BURGs more/better docs expand user-level documentation document operator, three-letter, and long version of everything need an interface for returning a partial match if partial input given array matcher should match array-like things like enum or (especially) two-way enum (cursor!) How should Reg::Array match cursors? arguments (including backref's) in property matchers discontinuous number sets (and reg multipliers for them) v? lookahead (including negated regmultiples) lookback laziness inspect (mostly implemented... but maybe needs another name) fix all the warnings document sep and splitter rdoc output on rubyforge other docs on rubyforge v borrow jim weirich's deferred need interface to get all possible matches alias +@ to reg in ItemThatLike? x reg-nature needs be infectious in ItemThat v or have a reg_that constructor like item_that, which makes an item_that extended by reg? v reg::hash must not descend from ::hash depth-mostly matches via ++[/pathkey/**/pathval/,...] need a way to constrain the types of matcher that are allowed -in a particular Reg::Array and (some of) its children -eg in lex mode, String|Regexp|Integer|:to_s[]|OB|OBS - in type mode, Class|Module|Reg::Knows|nil|true|false|OB|OBS - in depth-mostly mode, Reg::Pair|Symbol|Symbol::WithArgs|Integer|Reg::Reg - in ordered hash mode, Reg::Pair|Symbol|Reg::Reg|String - in ordered obj mode, Reg::Pair|Symbol|Symbol::WithArgs -Subseq, Not, And, Or, Xor, and the like are allowed in all modes if conforming to the same restrictions Pair and Knows::WithArgs need constraint parameterization this way too. v what is the meaning of :meth[]? no parameters for parameterlessness, use +:meth all reg classes and matchers need to implement #==, #eql?, and #hash -defaults only check object ids, so for instance, currently: +[] != +[] Reg::Array should be renamed Reg::Sequence (or something...) it's not just for arrays anymore... when extending existing classes, check for func names already existing and chain to them -(or just abort if the wrong ones are defined.) v conflict in Set#&,|,^,+,- allow expressions like this in hash and object matchers: +{:foo=>/bar/.-} to mean that the -value is optional, but if non-default, must match /bar/. v potentially confusing name conflict: const vs Const (in regsugar.rb vs regdeferred.rb) sugar is too complicated. need to split into many small files in their own -directory, ala the nano gem. (makes set piracy easier too.) add methods to Module/Class to declare which methods are safe/dangerous -then allow only safe methods to be called via item_that/Reg::Object, etc. add lots more instrumentation remove weird eee stuff in regitem_that.rb need an object matcher that takes positional instead of named parameters... -more succinct, but slightly more limited than the current form. I need ArrayMatchSet (like SubseqMatchSet), Hash/ObjectMatchSet (like AndMatchSet) -each of these will have to keep track of how many other matchsets were pushed on -the stack while they were being matched. AndMatchSet still needs a lot of work. need vector analogs to the scalar matchers item_that and reg_that, called items_that and regs_that infectious modules: Multiple infects every container except Array (not allowed in Hash,Object,RestrictHash,Case) Undoable infects every container (implies HasCmatch or HasBmatch) HasCmatch infects every Multiple container (& infects non-Multiple with HasBmatch) HasBmatch infects every container (unless HasCmatch also present) HasCmatch_And_Bound infects every container (&infects with HasCmatch too) known bugs: no backreferences no substitutions v vector & and ^ wont work explicit duck-typing (on mmatch) is used to distinguish regs and literals... should be is_a? Reg::Reg instead. 0*Infinity should at least cause a warning some test cases are so slow as to be effectively unusable. reg - the ruby extended grammar Copyright (C) 2005, 2016 Caleb Clausen This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Project
reg
The reg pattern matching/replacement language
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
Dependencies
Runtime
>= 0.2.3
Project Readme