-1

Assuming I have a dataframe called df and regex as follows:

var df2 = df
regex = new Regex("_(.)")
for (col <- df.columns) {
      df2 = df2.withColumnRenamed(col, regex.replaceAllIn(col, { M => M.group(1).toUpperCase }))
    }

I know that this code is renaming columns of df2 such that if I had a column name called "user_id", it would become userId.

I understand what withcolumnRenamed and replaceAllIn functions do. What I do not understand is this part: { M => M.group(1).toUpperCase }

What is M? What is group(1)?

I can guess what is happening because I know that the expected output is userId but I do not think I fully understand how this is happening.

Could someone help me understand this? Would really appreciate it.

Thanks!

laughedelic
  • 5,499
  • 1
  • 24
  • 34
activelearner
  • 5,115
  • 14
  • 42
  • 77

2 Answers2

0

The signature of the replaceAllIn method is

replaceAllIn(target: CharSequence, replacer: (Match) ⇒ String): String

So that M is a Match and it has a group method, which returns

The matched string in group i, or null if nothing was matched

A group in regex is what's matched by the (sub)regex in parenthesis (., i.e. one symbol in your case). You can have several capturing groups and you can name them or refer to them by index. You can read more about capturing groups here and in the Scala API docs for Regex.

So { M => M.group(1).toUpperCase } means that you replace every match with the symbol in it that goes after _ changed to upper case.

laughedelic
  • 5,499
  • 1
  • 24
  • 34
0

M just stands for match, and group (1) refer to group (1) that is captured by regex. Consider this example:

World Cup

if you want to match the example above with regex, you will write something like this \w+\s\w+, however, you can make use of the groups, and write it this way:

(\w+)\s(\w+)

The parenthesis in Regex are used to indicated groups. In the example above, the first (\w+) is group 1 which will match World. The second (\w+) will match group 2 in regex which is Cup. If you want to match the whole thing, you can use group 0 which will match the whole thing.

See the groups in action here on the right side: https://regex101.com/r/v0Ybsv/1

Ibrahim
  • 5,333
  • 2
  • 32
  • 48