NSArray from NSCharacterSet

Question

Currently I am able to make array of Alphabets like below

[[NSArray alloc]initWithObjects:@"A",@"B",@"C",@"D",@"E",@"F",@"G",@"H",@"I",@"J",@"K",@"L",@"M",@"N",@"O",@"P",@"Q",@"R",@"S",@"T",@"U",@"V",@"W",@"X",@"Y",@"Z",nil];

Knowing that is available over

[NSCharacterSet uppercaseLetterCharacterSet]

How to make an array out of it?

Why you need this. Or just for fun? If you can tell why you need it in array then it would be good. — Anoop Vaidya, Apr 01 '13 at 10:29
The uppercaseLetterCharacterSet contains a lot more than just A...Z. — CommaToast, Aug 12 '16 at 18:17

Martin R · Accepted Answer · 2016-09-03T11:00:45.847

The following code creates an array containing all characters of a given character set. It works also for characters outside of the "basic multilingual plane" (characters > U+FFFF, e.g. U+10400 DESERET CAPITAL LETTER LONG I).

NSCharacterSet *charset = [NSCharacterSet uppercaseLetterCharacterSet];
NSMutableArray *array = [NSMutableArray array];
for (int plane = 0; plane <= 16; plane++) {
    if ([charset hasMemberInPlane:plane]) {
        UTF32Char c;
        for (c = plane << 16; c < (plane+1) << 16; c++) {
            if ([charset longCharacterIsMember:c]) {
                UTF32Char c1 = OSSwapHostToLittleInt32(c); // To make it byte-order safe
                NSString *s = [[NSString alloc] initWithBytes:&c1 length:4 encoding:NSUTF32LittleEndianStringEncoding];
                [array addObject:s];
            }
        }
    }
}

For the uppercaseLetterCharacterSet this gives an array of 1467 elements. But note that characters > U+FFFF are stored as UTF-16 surrogate pair in NSString, so for example U+10400 actually is stored in NSString as 2 characters "\uD801\uDC00".

Swift 2 code can be found in other answers to this question. Here is a Swift 3 version, written as an extension method:

extension CharacterSet {
    func allCharacters() -> [Character] {
        var result: [Character] = []
        for plane: UInt8 in 0...16 where self.hasMember(inPlane: plane) {
            for unicode in UInt32(plane) << 16 ..< UInt32(plane + 1) << 16 {
                if let uniChar = UnicodeScalar(unicode), self.contains(uniChar) {
                    result.append(Character(uniChar))
                }
            }
        }
        return result
    }
}

Example:

let charset = CharacterSet.uppercaseLetters
let chars = charset.allCharacters()
print(chars.count) // 1521
print(chars) // ["A", "B", "C", ... "]

(Note that some characters may not be present in the font used to display the result.)

Thanks @Martin R I was having trouble with contains (unicodeScalar) bit. this is awesome :) — the Reverend, Sep 03 '16 at 17:17

Cœur · Answer 2 · 2020-01-25T03:54:04.167

Inspired by Satachito answer, here is a performant way to make an Array from CharacterSet using bitmapRepresentation:

extension CharacterSet {
    func characters() -> [Character] {
        // A Unicode scalar is any Unicode code point in the range U+0000 to U+D7FF inclusive or U+E000 to U+10FFFF inclusive.
        return codePoints().compactMap { UnicodeScalar($0) }.map { Character($0) }
    }

    func codePoints() -> [Int] {
        var result: [Int] = []
        var plane = 0
        // following documentation at https://developer.apple.com/documentation/foundation/nscharacterset/1417719-bitmaprepresentation
        for (i, w) in bitmapRepresentation.enumerated() {
            let k = i % 0x2001
            if k == 0x2000 {
                // plane index byte
                plane = Int(w) << 13
                continue
            }
            let base = (plane + k) << 3
            for j in 0 ..< 8 where w & 1 << j != 0 {
                result.append(base + j)
            }
        }
        return result
    }
}

Example for uppercaseLetters

let charset = CharacterSet.uppercaseLetters
let chars = charset.characters()
print(chars.count) // 1733
print(chars) // ["A", "B", "C", ... "]

Example for discontinuous planes

let charset = CharacterSet(charactersIn: "")
let codePoints = charset.codePoints()
print(codePoints) // [120488, 837521]

Performances

Very good: this solution built in release with bitmapRepresentation seems 3 to 10 times faster than Martin R's solution with contains or Oliver Atkinson's solution with longCharacterIsMember.

score 10 · Answer 3 · edited Sep 02 '18 at 06:22

10

Since characters have a limited, finite (and not too wide) range, you can just test which characters are members of a given character set (brute force):

// this doesn't seem to be available
#define UNICHAR_MAX (1ull << (CHAR_BIT * sizeof(unichar)))

NSData *data = [[NSCharacterSet uppercaseLetterCharacterSet] bitmapRepresentation];
uint8_t *ptr = [data bytes];
NSMutableArray *allCharsInSet = [NSMutableArray array];
// following from Apple's sample code
for (unichar i = 0; i < UNICHAR_MAX; i++) {
    if (ptr[i >> 3] & (1u << (i & 7))) {
        [allCharsInSet addObject:[NSString stringWithCharacters:&i length:1]];
    }
}

Remark: Due to the size of a unichar and the structure of the additional segments in bitmapRepresentation, this solution only works for characters <= 0xFFFF and is not suitable for higher planes.

edited Sep 02 '18 at 06:22

Cœur

32,421
21
173
232

answered Apr 01 '13 at 10:29

5

oooppppssssss. To understand this code, we need 50K+ reputations. People will get scared by this code. – Anoop Vaidya Apr 01 '13 at 10:33
@H2CO3, I thought i am just not knowing an existence of a method to call on NSCharacterSet or NSString to do this job with a one line statement. Looks like it is truly not exists. Good to see the possibility from your response. Thanks. – Saran Apr 01 '13 at 11:56
1

Remark: This works only for characters <= 0xFFFF. The `uppercaseLetterCharacterSet` contains 1467 characters, this method gives only the first 871 characters. – Martin R Apr 01 '13 at 12:02
@MartinR Right, at least as long as `unichar` is two ~~bytes~~ octets long (which it is on iOS and OS X). – Apr 01 '13 at 12:04
@H2CO3: `NSCharacterSet` works also with characters outside the BMP, even if `NSString` uses `unichar` internally. – Martin R Apr 01 '13 at 12:09
Personally I think that the OP doesn't even know what's he' asking because getting a list of all the `Lu` and `Lt` characters doesn't have a real use. – Sulthan Apr 02 '13 at 09:12
@Sulthan Yes, that's quite possible. But anyways, he got what he asked for :) Better be technically correct than make wrong assumptions. – Apr 02 '13 at 09:14
Can't you use `[charSet characterIsMember:]` to check if `unichar` is in the set? – Arc676 Dec 29 '14 at 05:29
Needs a lot of memory! My device runs out of memory! – Abdurrahman Mubeen Ali May 28 '15 at 07:21

felipou · Answer 4 · 2016-01-14T19:18:30.320

4

I created a Swift (v2.1) version of Martin R's algorithm:

let charset = NSCharacterSet.URLPathAllowedCharacterSet();

for var plane : UInt8 in 0...16 {
    if charset.hasMemberInPlane( plane ) {
        var c : UTF32Char;

        for var c : UInt32 = UInt32( plane ) << 16; c < (UInt32(plane)+1) << 16; c++ {
            if charset.longCharacterIsMember(c) {
                var c1 = c.littleEndian // To make it byte-order safe
                let s = NSString(bytes: &c1, length: 4, encoding: NSUTF32LittleEndianStringEncoding);
                NSLog("Char: \(s)");
            }
        }
    }
}

edited Jan 14 '16 at 19:18

answered Nov 25 '15 at 13:24

felipou

559
4
13

1

`c1` is unlikely to work as `let` because of in-out `&`, should probably be `var` – Desmond Hume Jan 13 '16 at 18:13
You're right, I fixed it. But I was sure I had tested this before... Well, anyway, it's correct as of Swift 2.1.1, just tested it (`Apple Swift version 2.1.1 (swiftlang-700.1.101.15 clang-700.1.81)`) – felipou Jan 14 '16 at 19:20
Now that explains it! How could I not see that? Well, thanks for pointing it out @DesmondHume :) – felipou Jan 16 '16 at 13:37
@felipou: I apologize for the confusion. I wanted to add the (Swift equivalent of) OSSwapHostToLittleInt32 and then made some errors. Everything should be correct now. – Martin R Jan 18 '16 at 17:52
No problem @MartinR, I understand, it's much better this way. Thanks for the contribution :) – felipou Jan 19 '16 at 18:41
longCharacterIsMember appears to be gone for Swift 3 – David James Aug 23 '16 at 16:14
Nevermind, just use `(characterSet as NSCharacterSet).longCharacterIsMember(c)` in Swift 3 (Xcode 8 Beta 6) – David James Aug 23 '16 at 16:26

Oliver Atkinson · Answer 5 · 2017-03-20T13:01:01.387

This is done using a little more of swift for swift.

let characters = NSCharacterSet.uppercaseLetterCharacterSet()
var array      = [String]()

for plane: UInt8 in 0...16 where characters.hasMemberInPlane(plane) {

  for character: UTF32Char in UInt32(plane) << 16..<(UInt32(plane) + 1) << 16 where characters.longCharacterIsMember(character) {

    var endian = character.littleEndian
    let string = NSString(bytes: &endian, length: 4, encoding: NSUTF32LittleEndianStringEncoding) as! String

    array.append(string)

  }

}

print(array)

score 1 · Answer 6 · answered Mar 09 '18 at 13:33

I found Martin R's solution to be too slow for my purposes, so I solved it another way using CharacterSet's bitmapRepresentation property.

This is significantly faster according to my benchmarks:

var ranges = [CountableClosedRange<UInt32>]()
let bitmap: Data = characterSet.bitmapRepresentation
var first: UInt32?, last: UInt32?
var plane = 0, nextPlane = 8192
for (j, byte) in bitmap.enumerated() where byte != 0 {
    if j == nextPlane {
        plane += 1
        nextPlane += 8193
        continue
    }
    for i in 0 ..< 8 where byte & 1 << i != 0 {
        let codePoint = UInt32(j - plane) * 8 + UInt32(i)
        if let _last = last, codePoint == _last + 1 {
            last = codePoint
        } else {
            if let first = first, let last = last {
                ranges.append(first ... last)
            }
            first = codePoint
            last = codePoint
        }
    }
}
if let first = first, let last = last {
    ranges.append(first ... last)
}
return ranges

This solution returns an array of codePoint ranges, but you can easily adapt it to return individual characters or strings, etc.

Actually, there is a significant error in your algorithm: it will not support `CharacterSet(charactersIn: "")` because you do not read the value of the plane index byte (you wrongly assumed they were continous). See https://stackoverflow.com/a/52133647/1033581 for how I did it. — Cœur, Sep 02 '18 at 06:42

score 0 · Answer 7 · answered Aug 12 '16 at 18:20

For just A-Z of the Latin alphabet (nothing with Greek, or diacritical marks, or other things that were not what the guy asked for):

for plane: UInt8 in 0...16 where characters.hasMemberInPlane(plane) {
    i = 0
    for character: UTF32Char in UInt32(plane) << 16...(UInt32(plane) + 1) << 16 where characters.longCharacterIsMember(character) {
        var endian = character.littleEndian
        let string = NSString(bytes: &endian, length: 4, encoding: NSUTF32LittleEndianStringEncoding) as! String
        array.append(string)
        if(array.count == 26) {
            break
        }
    }
    if(array.count == 26) {
        break
    }
}

If you know there is going to be 26 characters, then you're not working with an arbitrary character set, which means you can optimize it in speed and in length with just `return ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"]` — Cœur, Sep 02 '18 at 10:20

score 0 · Answer 8 · answered Nov 02 '17 at 15:45

You should not; this is not the purpose of a character set. A NSCharacterSet is a possibly-infinite set of characters, possibly in not-yet-invented code points. All you want to know is "Is this character or collection of characters in this set?", and to that end it is useful.

Imagine this Swift code:

let asciiCodepoints = Unicode.Scalar(0x00)...Unicode.Scalar(0x7F)
let asciiCharacterSet = CharacterSet(charactersIn: asciiCodepoints)
let nonAsciiCharacterSet = asciiCharacterSet.inverted

Which is analogous to this Objective-C code:

NSRange asciiCodepoints = NSMakeRange(0x00, 0x7F);
NSCharacterSet * asciiCharacterSet = [NSCharacterSet characterSetWithRange:asciiCodepoints];
NSCharacterSet * nonAsciiCharacterSet = asciiCharacterSet.invertedSet;

It's easy to say "loop over all the characters in asciiCharacterSet"; that would just loop over all characters from U+0000 through U+007F. But what does it mean to loop over all the characters in nonAsciiCharacterSet? Do you start at U+0080? Who's to say there won't be negative codepoints in the future? Where do you end? Do you skip non-printable characters? What about extended grapheme clusters? Since it's a set (where order doesn't matter), can your code handle out-of-order codepoints in this loop?

These are questions you don't want to answer here; functionally nonAsciiCharacterSet is infinite, and all you want to use it for is to tell if any given character lies outside the set of ASCII characters.

The question you should really be asking yourself is: "What do I want to accomplish with this array of capital letters?" If (and likely only if) you really need to iterate over it in order, putting the ones you care about into an Array or String (perhaps one read in from a resource file) is probably the best way. If you want to check to see if a character is part of the set of uppercase letters, then you don't care about order or even how many characters are in the set, and should use CharacterSet.uppercaseLetters.contains(foo) (in Objective-C: [NSCharacterSet.uppercaseLetterCharacterSet contains: foo]).

Think, too, about non-latin characters. CharacterSet.uppercaseLetters covers Unicode General Categories Lu and Lt, which contain A through Z and also things like ǅ, , and Խ. You don't want to have to think about this. You definitely don't want to issue an update to your app when the Unicode Consortium adds new characters to this list. If what you want to do is decide whether something is upper-case, don't bother hard-coding anything.

A CharacterSet, by its struct definition, is finite: it has at most 17 planes of 8192 endpoints. — Cœur, Sep 02 '18 at 10:24
@Cœur has it always been like that? Will it always be like that? What if an 18th plane is needed? Can you provide official documentation promising all this? — Ben Leggiero, Sep 04 '18 at 17:46

Paul B · Answer 9 · 2019-09-20T11:25:21.513

You can of course create sets of characters and alphabets using CharacterSet like this:

var smallEmojiCharacterSet = CharacterSet(charactersIn:  Unicode.Scalar("")...Unicode.Scalar(""))

The problem is that CharacterSet is NOT a Set (though it conforms to SetAlgebra), it is rather a unicode character set . This causes the problem of getting a sequence of all it's characters, to convert it to Array, Set or a String. I have found a solution, but a better one exists. Actually, what you want is to stride from character to character, to have a range "a"..."z". It is not hard to do at the scalar level. At Character level there are more caveats to consider.

extension Unicode.Scalar: Strideable {
    public typealias Stride = Int

    public func distance(to other: Unicode.Scalar) -> Int {
        return Int(other.value) - Int(self.value)
    }

    public func advanced(by n: Int) -> Unicode.Scalar {
        return Unicode.Scalar(UInt32(Int(value) + n))!
    }
}


let alphabetScalarRange = (Unicode.Scalar("a")...Unicode.Scalar("z"))// ClosedRange<Unicode.Scalar>

let alphabetCharactersArr = Array(alphabetScalarRange.map(Character.init)) // Array of Characters from range
let alphabetStringsArr = Array(alphabetScalarRange.map(String.init)) // Array of Strings from range
let alphabetString = alphabetStringsArr.joined() // String (collection of characters) from range
// or simply
let uppercasedAlphabetString =  (("A" as Unicode.Scalar)..."Z").reduce("") { (r, us) -> String in
    r + String(us)
}

If you think making an extension is an overkill

let alphabetScalarValueRange = (Unicode.Scalar("a").value...Unicode.Scalar("z").value)
let alphabetStringsArr2 = Array(alphabetScalarValueRange.compactMap{ Unicode.Scalar($0)?.escaped(asASCII: false) })
let alphabetString2 = alphabetScalarValueRange.compactMap({ Unicode.Scalar($0)?.escaped(asASCII: false) }).joined(separator: ", ")

But be careful: Characters can consist of several scalars.

NSArray from NSCharacterSet

9 Answers9

Example for uppercaseLetters

Example for discontinuous planes

Performances

Linked

Related