4

I need to convert certain words to lower case. I am working with a list of movie titles, where prepositions and articles are normally lower case if they are not the first word in the title. If I have the vector:

movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace', 'Me And Earl And The Dying Girl')

What I need is this:

movies_updated = c('The Kings of Summer', 'The Words', 'Out of the Furnace', 'Me and Earl and the Dying Girl')

Is there an elegant way to do this without using a long series of gsub(), as in:

movies_updated = gsub(' In ', ' in ', movies)
movies_updated = gsub(' In', ' in', movies_updated)
movies_updated = gsub(' Of ', ' of ', movies)
movies_updated = gsub(' Of', ' of', movies_updated)
movies_updated = gsub(' The ', ' the ', movies)
movies_updated = gsub(' the', ' the', movies_updated)

And so on.

tsouchlarakis
  • 1,121
  • 2
  • 15
  • 39

3 Answers3

9

In effect, it appears that you are interested in converting your text to title case. This can be easily achieved with use of the stringi package, as shown below:

>> stringi::stri_trans_totitle(c('The Kings of Summer', 'The Words', 'Out of the Furnace'))
[1] "The Kings Of Summer" "The Words"           "Out Of The Furnace"

Alternative approach would involve making use of the toTitleCase function available in the the tools package:

>> tools::toTitleCase(c('The Kings of Summer', 'The Words', 'Out of the Furnace'))
[1] "The Kings of Summer" "The Words"           "Out of the Furnace" 
Konrad
  • 14,406
  • 15
  • 86
  • 141
  • 1
    Interestingly enough, the answer from `tools` worked for me, but `stringi` did not. When I paste `stringi::stri_trans_totitle(c('The Kings of Summer', 'The Words', 'Out of the Furnace'))` into the command prompt it yields `[1] "The Kings Of Summer" "The Words" [3] "Out Of The Furnace" `. – tsouchlarakis Apr 03 '17 at 07:14
  • 1
    @andoni34: Because what ICU based [`totitle`](https://cran.r-project.org/web/packages/stringi/stringi.pdf) does is *capitalizing the first letter of each word or sentence*. The `toTitleCase` from `tools` package is a kind of a "black box", see [its description here](https://stat.ethz.ch/R-manual/R-devel/library/tools/html/toTitleCase.html). – Wiktor Stribiżew Apr 03 '17 at 08:03
8

Though I like @Konrad's answer for its succinctness, I'll offer an alternative that is more literal and manual.

movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace',
           'Me And Earl And The Dying Girl')

gr <- gregexpr("(?<!^)\\b(of|in|the)\\b", movies, ignore.case = TRUE, perl = TRUE)
mat <- regmatches(movies, gr)
regmatches(movies, gr) <- lapply(mat, tolower)
movies
# [1] "The Kings of Summer"            "The Words"                     
# [3] "Out of the Furnace"             "Me And Earl And the Dying Girl"

The tricks of the regular expression:

  • (?<!^) ensures we don't match a word at the beginning of a string. Without this, the first The of movies 1 and 2 will be down-cased.
  • \\b sets up word-boundaries, such that in in the middle of Dying will not match. This is slightly more robust than your use of space, since hyphens, commas, etc, will not be spaces but do indicate the beginning/end of a word.
  • (of|in|the) matches any one of of, in, or the. More patterns can be added with separating pipes |.

Once identified, it's as simple as replacing them with down-cased versions.

r2evans
  • 77,184
  • 4
  • 55
  • 96
  • Beautiful, I like this answer for using base R and allowing flexibility with choosing which words to be lower case'd. This could work in a variety of different situations. Thank you very much. – tsouchlarakis Apr 03 '17 at 07:15
3

Another example of how to turn certain words to lower case with gsub (with a PCRE regex):

movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace', 'Me And Earl And The Dying Girl')
gsub("(?!^)\\b(Of|In|The)\\b", "\\L\\1", movies, perl=TRUE)

See the R demo

Details:

  • (?!^) - not at the start of the string (it does not matter if we use a lookahead or lookbehind here since the pattern inside is a zero-width assertion)
  • \\b - find leading word boundary
  • (Of|In|The) - capture Of or In or The into Group 1
  • \\b - assure there is a trailing word boundary.

The replacement contains the lowercasing operator \L that turns all the chars in the first backreference value (the text captured into Group 1) to lower case.

Note it can turn out a more flexible approach than using tools::toTitleCase. The code part that keeps specific words in lower case is:

## These should be lower case except at the beginning (and after :)
lpat <- "^(a|an|and|are|as|at|be|but|by|en|for|if|in|is|nor|not|of|on|or|per|so|the|to|v[.]?|via|vs[.]?|from|into|than|that|with)$"

If you only need to apply lowercasing and do not care about the other logic in the function, it might be enough to add these alternatives (do not use ^ and $ anchors) to the regex at the top of the post.

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • I knew it existed but was trying with lower-case `\\l`, thanks for straightening me out. This is what my answer *should* have been if I had spent more time on the perl-specific case-change step. – r2evans Apr 03 '17 at 07:30
  • Yeah, `\l` only turns the *first* char of the replacement value (that stands immediately next to it) to lower case. – Wiktor Stribiżew Apr 03 '17 at 07:30
  • That explains soooo much :-) – r2evans Apr 03 '17 at 07:31