How does this python code convert string to camelCase with regex sub() and group()?

Question

I'm a total newbie, please go easy on me :) This is a solution I found online of a kata from codewars

import re
def to_camel_case(text):
    return re.sub('[_-](.)', lambda x: x.group(1).upper(), text)

I looked up about re.sub() and group(), but I still couldn't put it together. I'm not sure how [_-](.) works, how come [_-](w+) doesn't work?
How did he get ride of the hyphen and underscore with sub? Then,
successfully capitalize only the first char of each words except the first word?
I thought x.group(1).upper() would capitalize the entire word, how come group(1) is referring to the first char?

I'm not sure this is a valid question: https://stackoverflow.com/help/how-to-ask. The code is pretty advanced, in my opinion, and might not be a good start for learning Python, as you need to know about modules, functions, regular expressions, lambda notation, etc. It might be helpful to take it slowly and complete series of exercises such as [this one](https://regexone.com/). — sammy, Dec 07 '19 at 23:40
I'm not sure why this wouldn't be a proper question to post on stack. It's not even a homework question. I've completed this kata independently with re.split() but I'm still a newbie to regex. Yes I've done my research, read the reference and documentation in regex but I still didn't understand this solution. Sometimes it can be really difficult to be an independent learner who's first language wasn't english and has certain level of learning disability. That exercise is great, I will definitely try it out. — kuku, Dec 08 '19 at 00:30

James Mchugh · Answer 1 · 2019-12-08T01:06:27.130

So to understand this block of code, you have to understand a bit of regular expressions and a bit of the re Python module. Let's first look at what re.sub does. From the docs, the signature of the function looks like

re.sub(pattern, repl, string, count=0, flags=0)

Of importance here are the pattern, repl, and string parameters.

pattern is a regular expression pattern to be replaced
repl is what you want to replace the matched pattern with, can be a string or function that takes a match object as an argument
string is the string you want the replacement to act on

The function is used to find portions of the string that match the regex pattern, and replace those portions with repl.

Now let's go into the regular expression used: [_-](.).

[_-] matches any of the characters within the square brackets (_ or -)
. matches any character
(.) captures any character in a capture group

Now let's put it all together. The full pattern will match two characters. The first character will be a _ or - and the second character can be anything. In effect, the bold portions of the following strings will be matched.

one_two
test_3
nomatchhere-
thiswill_Match
NoMatchHereEither_
need_more_creative_examples-

The important part here is that the (.) portion of the regex matches any character and stores it in a capture group, this allows us to reference that matched character in the repl part of the argument.

Let's get into what repl is doing here. In this case, repl is a lambda function.

lambda x: x.group(1).upper()

A lambda is really not too much different than a normal Python function. You define arguments before the colon, and then you define the return expression after the colon. The lambda above takes x as an argument, and it assumes that x is a match object. Match objects have a group method that allows you to reference the groups matched by the regex pattern (remember (.) from before?). It grabs the first group matched, and uppercases it using the str object's builtin upper method. It then returns that uppercased string, and that is what replaces the matched pattern.

All together now:

import re
def to_camel_case(text):
    return re.sub('[_-](.)', lambda x: x.group(1).upper(), text)

The pattern is [_-](.) which matches any underscore or dash followed by any character. That character is captured and uppercased using the repl lambda function. The portion of string that matched that pattern is then replaced with that uppercased character.

In conclusion, I think the above answers most of your questions, but to summarize:

I looked up about re.sub() and group(), but I still couldn't put it together. I'm not sure how [_-](.) works, how come [_-](w+) doesn't work?

I will assume that you meant to use the \w character set, instead of just w. The \w character set matches all alphanumeric characters and underscores. This pattern would work if the + operator was not used. The + matches characters greedily, so it will cause all characters that belong to the \w set that follow an underscore or hyphen to be captured. This causes two issues: it will capitalize all captured characters (which could be a whole word) and it will capture underscores, causing later underscores to not be properly replaced.

How did he get ride of the hyphen and underscore with sub?

The function given to repl returns only the uppercased version of the first capture group. In the pattern [-_](.), only the character following the hyphen or underscore is captured. In effect, the pattern [-_](.) is matched and replaced with the uppercased character matched by (.). This is why the hyphen/underscore is removed.

Successfully capitalize only the first char of each words except the first word? I thought x.group(1).upper() would capitalize the entire word, how come group(1) is referring to the first char?

The capture group only matches the first character following the underscore or hyphen, so that is what is uppercased.

score 1 · Accepted Answer · answered Dec 08 '19 at 00:01

I'll try to walk through the solution in Layman's terms.

So firstly, re.sub() searches for occurrences of the pattern specified '[_-](.)' which will match any substrings where a hyphen '-' or an underscore '_' is immediately before another character. The re.sub() function then runs these matches through the anonymous function (lambda function) individually.

Regex grouping in python essentially involves those braces () to collect a sub-expression for later use in the program. The lambda function will take in some regex object generated from searching text for the provided pattern, and then return x.group(1).upper(), and we can see from the regular expression, that the grouped element, is the single character that follows the hyphen or underscore, which is what is returned and substituted by the function.

Now, to answer your dotpoints:

Why doesn't [_-](\w+) work? This is because, when it finds a hypen, it will select all of the alphanumeric characters that follow it, so it will capitalise the entirety of the next word.

How did he get rid of the hyphen and underscore with sub? This is easily answered. The re.sub() function replaces the entire match, not just the grouped element, and in the lambda, he only returns the grouped element as uppercase, not the hyphen as well.

Successfully capitalise only the first char of each word except the first word? When the regex pattern is searched for, it is looking for characters that immediately proceed a hyphen or an underscore, and the first word does not either of those characters before. If you were to feed the function something like '-hello-there' it would yield: 'HelloThere'

I thought x.group(1).upper() would capitalize the entire word, how come group(1) is referring to the first char? This is down to the pattern, because the pattern is '[_-](.)' and not '[_-](.+)', it only matches a single character

I hope this has helped you in some way

That's a hard one to choose. I will accept yours as it answered exactly what I asked with layman's terms. — kuku, Dec 08 '19 at 00:33

How does this python code convert string to camelCase with regex sub() and group()?

2 Answers2