11

Equivalently, how can I typespec for a "single" UTF8 char?

Within a type definition, I can have generic "any string" or "any utf8 string" with

@type tile :: String.t # matches any string
@type tile :: <<_::8>> # matches any single byte

but it seems I can't match for the first bit to be 0

@type tile :: <<0::1, _::7>>

The case for single UTF bit sequence would be

@type tile :: <<0::1, _::7>> | 
              <<6::3, _::5, 2::2, _::6>> | 
              <<14::4, _::4, 2::2, _::6, 2::2, _::6>> |
              <<30::5, _::3, 2::2, _::6, 2::2, _::6, 2::2, _::6>>

(these bit patterns match when using pattern matching, for instance

<<14::4, _::4, 2::2, _::6, 2::2, _::6>> = "○"

succeeds.)

But when used in typespecs, the compiler complains greatly with

== Compilation error in file lib/board.ex ==
** (ArgumentError) argument error
    (elixir) lib/kernel/typespec.ex:1000: Kernel.Typespec.typespec/3
    (elixir) lib/kernel/typespec.ex:1127: anonymous fn/4 in Kernel.Typespec.typespec/3
    (elixir) lib/enum.ex:1899: Enum."-reduce/3-lists^foldl/2-0-"/3
    (elixir) lib/kernel/typespec.ex:1127: Kernel.Typespec.typespec/3
    (elixir) lib/kernel/typespec.ex:828: anonymous fn/4 in Kernel.Typespec.typespec/3
    (elixir) lib/enum.ex:1899: Enum."-reduce/3-lists^foldl/2-0-"/3
    (elixir) lib/kernel/typespec.ex:828: Kernel.Typespec.typespec/3
    (elixir) lib/kernel/typespec.ex:470: Kernel.Typespec.translate_type/3

Is there any way to typespec to some bit pattern like this?

Onorio Catenacci
  • 14,322
  • 12
  • 75
  • 122
rewritten
  • 14,591
  • 2
  • 39
  • 48
  • 2
    I'm going to add the Erlang tag (and the dialyzer tag too) to this question because I think it's more of an issue with Dialyzer than it is something specific to Elixir. – Onorio Catenacci Jan 09 '18 at 18:25
  • 2
    I don't think it is possible. My best guess is to specify a range `0..127::8` but I don't think it will work. – José Valim Jan 10 '18 at 09:50
  • Given what I saw in the dialyzer docs, it seems that a `char()` type spec would be closest to what you want but that still allows 0..255 (rather than just the 0..127 range). – Onorio Catenacci Jan 10 '18 at 14:58
  • @OnorioCatenacci indeed. I really want to match a "single" UTF8 codepoint, which can vary from 8 to 32 bits with specific bit patterns, so char() won't do. – rewritten Jan 10 '18 at 15:05
  • Can you say more about the particular use case you need that type match for? Perhaps there's a way to make the type-checker "think" it's getting a UTF-8 by faking some process. – גלעד ברקן Jan 12 '18 at 12:05
  • It's a struct that accepts some optional one-char symbols for custom output. Say you prefer to use "x" and "o", or you prefer to use "☯︎" and "☀︎", or even "" and "‍". But it must be a single unicode character. – rewritten Jan 12 '18 at 18:16
  • 2
    Ok, so we still might need more context to determine why a simple validation function that can use the pattern-matching, which we know works, wouldn't suffice for your purposes. How would being able to typespec this one utf8 character help the type-checking you are performing? As I understand, the compiler for Erlang and Elixir does not observe type mismatches (other than compilation errors like the one you provided), so I assume the type-checking is for your own internal review. – גלעד ברקן Jan 12 '18 at 22:46
  • Right, it's basically documentation, but still it does make sense to typespec it. It's not like "I want to typespec a sequence of bits of size between 42 and 314 where the ratio of 0s is approximately 0.432". It's "One single unicode point in UTF8". – rewritten Jan 13 '18 at 15:41
  • In some sense, I can't validate because it's just a field of a struct. The only way I know to prevent wrong things to be set as struct fields is to typecheck. I am new to Elixir so I'm probably wrong. – rewritten Jan 13 '18 at 15:44
  • It does seem like a missing feature of bit typespecs, even if there is some good reason behind it. If it's basically documentation, maybe an additional comment would be enough. As for validation, you could always post a question about it, providing a little more context. – גלעד ברקן Jan 14 '18 at 16:12

1 Answers1

1

You cannot typespec on binary patterns only on sole fact of the binary. Even if you could define such specs then I do not believe that Dialyzer is sophisticated enough to find failures in such matches. You are left only with implementing such behaviour using guards and pattern matches in runtime, like:

def unicode?(<<0::size(1), a::size(7)>>), do: true
def unicode?(<<6::3, _::5, 2::2, _::6>>), do: true 
def unicode?(<<14::4, _::4, 2::2, _::6, 2::2, _::6>>), do: true
def unicode?(<<30::5, _::3, 2::2, _::6, 2::2, _::6, 2::2, _::6>>), do: true
def unicode?(str) when is_binary(str), do: false

Unfortunately as far as I know there is no way to have bit patterns in guards, you can only match on whole bytes using binary_part/3, but there is no function to do the same for bits. So the nearest you could get is something like this (untested whether this works or even compile, but give you general view on what is possible):

defguardp is_valid_utf_part(code) when code in 0b10000000..0b10111111

defguard is_unicode(<<ascii>>) when ascii in 0b0000000..0b01111111
defguard is_unicode(<<first, second>>)
  when first in 0b11000000..0b11011111
   and is_valid_utf_part(second)
defguard is_unicode(<<first, second, third>>)
  when first in 0b11100000..0b11101111
   and is_valid_utf_part(second)
   and is_valid_utf_part(third)
defguard is_unicode(<<first, second, third, fourth>>)
  when first in 0b11110000..0b11110111
   and is_valid_utf_part(second)
   and is_valid_utf_part(third)
   and is_valid_utf_part(fourth)
Hauleth
  • 20,457
  • 4
  • 58
  • 98