0

Imagine I have the following tibble df:

id    doc                    doc_word_count
-------------------------------------------
1     Lorem ipsum dolor...   1439
2     Lorem ipsum dolor...   10234 
3     Lorem ipsum dolor...   2000 
4     Lorem ipsum dolor...   15034 
5     Lorem ipsum dolor...   11000

where doc_word_count measures the number of words in doc. What I would like to do is split the doc column into 500 (but this number is arbitrary) words per row. The new tibble df_split should look something like this:

id    doc                    doc_word_count
-------------------------------------------
1     Lorem ipsum dolor...   500
1     labore et dolore...    500
1     totam rem aperiam...   439
2     ...                    500
...   ...                    500
...   ...                    ...

If there are not 500 words left in the last chunk, then it should just store as many words as there are left. I have looked at str_split and this StackOverflow post but neither seems relevant here because I am not using a pattern or a fixed character width to split the string.

1 Answers1

1

You can use tidytext::unnest_tokens(), which essentially extracts words from a string and pivots the data frame to one word per row. From there, you can use the %/% operator to create new groupings and recombine the words into a single string.

suppressPackageStartupMessages({
library(dplyr)
library(tidytext)
library(stringi)
library(stringr)})

df <- tibble::tribble(~'id', ~'doc', ~'doc_word_count',
                      1, stringr::word(paste0(stringi::stri_rand_lipsum(1000), collapse = ' '), start = 1, end = 1439), 1439,
                      2, stringr::word(paste0(stringi::stri_rand_lipsum(1000), collapse = ' '), start = 1, end = 10234), 10234,
                      3, stringr::word(paste0(stringi::stri_rand_lipsum(1000), collapse = ' '), start = 1, end = 2000), 2000)

head(df)
#> # A tibble: 3 x 3
#>      id doc                                                       doc_word_count
#>   <dbl> <chr>                                                              <dbl>
#> 1     1 Lorem ipsum dolor sit amet, litora sollicitudin enim eu.~           1439
#> 2     2 Lorem ipsum dolor sit amet, sed viverra amet velit ut ve~          10234
#> 3     3 Lorem ipsum dolor sit amet, auctor convallis tristique v~           2000

df_split <- df %>% 
  tidytext::unnest_tokens(word, doc) %>% 
  dplyr::group_by(id) %>% 
  dplyr::mutate(new_grp = ((row_number()-1) %/% 500)) %>% 
  dplyr::group_by(id, new_grp) %>% 
  dplyr::summarize(doc_word_count = n(),
                   doc = paste0(word, collapse = ' ')) %>% 
  dplyr::ungroup() %>% 
  dplyr::select(id, doc, doc_word_count)
#> `summarise()` regrouping output by 'id' (override with `.groups` argument)

head(df_split)
#> # A tibble: 6 x 3
#>      id doc                                                       doc_word_count
#>   <dbl> <chr>                                                              <int>
#> 1     1 lorem ipsum dolor sit amet litora sollicitudin enim eu i~            500
#> 2     1 semper ullamcorper fames congue metus elementum condimen~            500
#> 3     1 tincidunt magnis vehicula amet elementum quisque eu vita~            439
#> 4     2 lorem ipsum dolor sit amet sed viverra amet velit ut vel~            500
#> 5     2 non arcu netus aptent imperdiet lobortis eros in nulla i~            500
#> 6     2 sem amet mattis sed feugiat ut arcu amet sed pellentesqu~            500
bradisbrad
  • 131
  • 4