0

I want to create a function that will allow me to convert CamelCase to Title Case. This seems like a good task for regular expressions, but I am not committed to using regular expressions, if you have a better solution.

Here is my first attempt that works in most cases, but there are some issues I will get to in a few lines:

private static Regex camelSplitRegex = new Regex(@"(\S)([A-Z])");
private static String camelReplacement = "$1 $2";

public String SplitCamel(String text){
    return camelSplitRegex.Replace(text, camelReplacement);
}

The regex pattern looks for a non-whitespace character (1st capture) followed by a capital letter (2nd capture). In the function, Regex.Replace is used to insert a space between the 1st and 2nd captures.

This works fine for many examples:

  • SplitCamel("privateField") returns "private Field"
  • SplitCamel("PublicMethod") returns "Public Method"
  • SplitCamel(" LeadingSpace") returns " Leading Space" without inserting an extra space before "Leading", as desired.

The problem I have is when dealing with multiple consecutive capital letters.

  • SplitCamel("NASA") returns "N AS A" not "N A S A"
  • SplitCamel("C3PO") returns "C3 PO" not "C3 P O"
  • SplitCamel("CAPS LOCK FEVER") returns "C AP S L OC K F EV E R" not "C A P S L O C K F E V E R"

In these cases, I believe the issue is that each capital letter is only being captured as either \S or [A-Z], but cannot be \S on one match and [A-Z] on the next match.


My main question is, "Does the .NET regex engine has some way of supporting the same substring being used as different captures on consecutive matches?" Secondarily, is there a better way of splitting camel case?

Blorgbeard
  • 93,378
  • 43
  • 217
  • 263
JamesFaix
  • 5,975
  • 3
  • 30
  • 62
  • To clarify, you definitely DO want consecutive capitalized letters to still be split (i.e., "NASA" should go to "N A S A") or is there a preference to keep a block of capitalized letters as a block? – CaseyR Mar 07 '16 at 19:15
  • The results should never contain two touching capital letters. – JamesFaix Mar 07 '16 at 19:42

3 Answers3

4
private static Regex camelSplitRegex = new Regex(@"(?<=\w)(?=[A-Z])");
private static String camelReplacement = " ";

does the job.

The problem with your pattern is that when you have the string "ABCD", \S matches A and ([A-Z]) matches B and you obtain "A BCD", but for the next replacement B is already consumed by the pattern and can't be used any more.

The way is to use lookarounds (a lookbehind (?<=...) and a lookahead (?=...)) that don't consume characters, they are only tests for the current position in the string, that's why you don't need any reference in the replacement string, you only need to put a space at the current position.

The \w character class contains unicode letters, unicode digits and the underscore. If you want to restrict the search to ASCII digits and letters, use [0-9a-zA-Z] instead.

To be more precise:

  • for unicode, use (?<=[\p{L}\p{N}])(?=\p{Lu}) that works with accented letters and other alphabets and digits.
  • for ASCII use (?<=[a-zA-Z0-9])(?=[A-Z])
Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
2

Here's a non-regular expression way to do that.

public static string SplitCamel(this string stuff)
{
    var builder = new StringBuilder();
    char? prev = null;
    foreach (char c in stuff)
    {
        if (prev.HasValue && !char.IsWhiteSpace(prev.Value) && 'A' <= c && c <= 'Z') 
            builder.Append(' ');
        builder.Append(c);
        prev = c;
    }

    return builder.ToString();
}

The following

Console.WriteLine("'{0}'", "privateField".SplitCamel());
Console.WriteLine("'{0}'", "PublicMethod".SplitCamel());
Console.WriteLine("'{0}'", " LeadingSpace".SplitCamel());
Console.WriteLine("'{0}'", "NASA".SplitCamel());
Console.WriteLine("'{0}'", "C3PO".SplitCamel());
Console.WriteLine("'{0}'", "CAPS LOCK FEVER".SplitCamel());

Prints

'private Field'

'Public Method'

' Leading Space'

'N A S A'

'C3 P O'

'C A P S L O C K F E V E R'

Community
  • 1
  • 1
juharr
  • 30,127
  • 4
  • 48
  • 88
0

please consider switching to the value type string instead of the string class. Update to this.

 private static Regex camelSplitRegex = new Regex(@"(^\S)?([A-Z])");
Kentonbmax
  • 787
  • 8
  • 16
  • I'm not sure what your comment means exactly, and I don't think your regex pattern will work in this instance. The preceding non-whitespace character should not be optional, that would allow matches that were preceded by a capital letter. – JamesFaix Mar 08 '16 at 13:06
  • Regarding string http://stackoverflow.com/questions/7074/whats-the-difference-between-string-and-string . Please test it against your examples or provide one that is not working. Please prove it does not work with a concrete example. – Kentonbmax Mar 08 '16 at 13:54
  • While I know that for some reason it is against Microsoft style guidelines, I typically use the CLR primitive type names "String" or "Int32" over "string" or "int" so that 1) all my type names are the same color in source code, 2) the type names used in my variable declarations match the type names that must be used to call static methods of these types ("Int32.Parse(text)"), 3) I can use the same primitive type names when dealing with multiple .NET languages, and 4) there is no confusion as to what size an "int" or "long" is, which varies between non-.NET languages. – JamesFaix Mar 10 '16 at 18:31
  • I cannot think of any advantage of using the C# primitive type keywords, aside from convention and a few ares where the C# keywords are required like Enum type declarations before VS2015 ("enum MyEnum : byte {}"). – JamesFaix Mar 10 '16 at 18:34
  • I actually got the idea from the book "CLR via C#" by Jeffrey Richter. He has a few other reasons for preferring the CLR type names. Check out the beginning of Chapter 5, which is on pg117 of this PDF. http://sd.blackball.lv/library/CLR_via_CSharp_%28Jeffrey_Richter_4th_Edition%29.pdf – JamesFaix Mar 10 '16 at 18:46
  • The advantages are for consistency, reliability, and performance. It is best practice to use value types when manipulating values and reference types when invoking functions. Value types are not nullable by default which prevents bugs. Ultimately its important to understand the differences and prevent other potential maintainers of your code from introducing new issues. There is also better performance when using Value types over reference types. Consider the a simple console app that checks the time when converting a int32 to long by type casting vs int. – Kentonbmax Mar 11 '16 at 13:17
  • "String" and "string" compile to identical assembly code and are always reference types. "Int32" and "int" compile identically and are always value types, but can get boxed in various situations. Using the CLR type name will not box them. The aliases "int" and "string" are just syntactic sugar so you don't have to include "using System;" in every file. That's what that thread you posted is all about actually. – JamesFaix Mar 12 '16 at 01:25