Imagine I have the following tibble df
:
id doc doc_word_count
-------------------------------------------
1 Lorem ipsum dolor... 1439
2 Lorem ipsum dolor... 10234
3 Lorem ipsum dolor... 2000
4 Lorem ipsum dolor... 15034
5 Lorem ipsum dolor... 11000
where doc_word_count
measures the number of words in doc
. What I would like to do is split the doc
column into 500 (but this number is arbitrary) words per row. The new tibble df_split
should look something like this:
id doc doc_word_count
-------------------------------------------
1 Lorem ipsum dolor... 500
1 labore et dolore... 500
1 totam rem aperiam... 439
2 ... 500
... ... 500
... ... ...
If there are not 500 words left in the last chunk, then it should just store as many words as there are left. I have looked at str_split
and this StackOverflow post but neither seems relevant here because I am not using a pattern or a fixed character width to split the string.