4

Ruby's regex literal can take the options i, m, x, which are documented. But besides them, it can accept much wider variety of options. Here is the inventory of the options that seem to be allowed:

//e # => //
//i # => //i  ignore case
//m # => //m  multiline
//n # => //n
//o # => //
//s # => //
//u # => //
//x # => //x  extended
  • What do they do? Are some of them related to encoding? What about others?
  • If they indicate encoding, then what happens when more than one encoding is specified?
  • While other options raise an unknown regex options error, the ones listed here do not. If the answer to the previous question is that they do nothing, then why are these particular options allowed?
  • Why is n reflected in the inspection while others are not? Do the ones whose inspection do not show difference, actually differ?

If there is a documentation, link to that would be appreciated.

sawa
  • 156,411
  • 36
  • 254
  • 350
  • 2
    I can add this, but the other are unknown to me: `o` -> perform #{...} substitutions only once – guido Apr 26 '14 at 14:33
  • From [this reference](http://www.zenspider.com/Languages/Ruby/QuickRef.html#regexen), we could say that `neus` is dedicated for encoding: `none, EUC, UTF-8, SJIS, respectively`. Not sure what EUC and SJIS is... – HamZa Apr 26 '14 at 14:38
  • @HamZa What is the difference between `//n` (encoding: none) and `//`? – sawa Apr 26 '14 at 14:39
  • No idea, but guido's answer does cover that – HamZa Apr 26 '14 at 14:41
  • 2
    I'm not sure what makes you say that those are _undocumented_. Did you look at the [documentation](http://www.ruby-doc.org/core-2.1.1/Regexp.html) yet? Specifically, [this](http://www.ruby-doc.org/core-2.1.1/Regexp.html#class-Regexp-label-Options) and [this](http://www.ruby-doc.org/core-2.1.1/Regexp.html#class-Regexp-label-Encoding). – devnull Apr 26 '14 at 14:51

2 Answers2

5

Regular-expression modifiers:

Regular expression literals may include an optional modifier to control various aspects of matching. The modifier is specified after the second slash character, as shown previously and may be represented by one of these characters:

Modifier    Description
i           Ignore case when matching text.
o           Perform #{} interpolations only once, the first time the regexp literal is evaluated.
x           Ignores whitespace and allows comments in regular expressions
m           Matches multiple lines, recognizing newlines as normal characters
u,e,s,n     Interpret the regexp as Unicode (UTF-8), EUC, SJIS, or ASCII. 
            If none of these modifiers is specified, the regular expression is 
            assumed to use the source encoding.

source

Note: that description above has proviso. See sawa's answer for that.

sawa
  • 156,411
  • 36
  • 254
  • 350
guido
  • 17,668
  • 4
  • 66
  • 89
  • To the second: no idea yet; to the fourth: standing to this doc I found `n` is for ASCII, wild guessing here, what is you local charset ? – guido Apr 26 '14 at 14:50
  • From file (without magic comment) or from irb, `//.encoding` gives me `#`. But I use Ruby 2.1, and `"".encoding` gives me `#`, which does not seem to match your description. – sawa Apr 26 '14 at 14:52
  • This information has been added to the [Stack Overflow Regular Expressions FAQ](http://stackoverflow.com/a/22944075/2736496) under "Modifiers". Since it's copied from another website, I've duplicated the information there, with thanks to this answer. – aliteralmind Apr 28 '14 at 00:09
2

I found some correction and complementation to guido's answer.

  • When no encoding is specified, then the regular expression is assumed to use the source encoding (which is UTF-8 in Ruby 2.0 if there is no magic comment at the beginning of the file) unless the regex only consists of single-byte characters, in which case the regex is converted to US-ASCII 1.

  • When more than one encoding option is specified, then the last one takes effect.

    //eu.encoding # => UTF-8
    //ue.encoding # => EUC
    
sawa
  • 156,411
  • 36
  • 254
  • 350