I am now trying to extract a large number of docx files (1500) placed in one folder, using readtext (after creating a list using list.files)
You can find similar examples here: https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html
I am getting errors with some files (examples below), the problem is when this error occurs, the extraction process is stopped. I can identify the problematic file, by changing verbosity = 3
, but then I have to restart the extraction process (to find another problematic file(s)).
My question is if there is a way to avoid interrupting the process if an error is encountered?
I change ignore_missing_files = TRUE
but this did not fix the problem.
examples for the errors encountered:
write error in extracting from zip file
Error: 'C:\Users--- c/word/document.xml' does not exist.
Sorry for not posting a reproducible example, but I do not know how to post an example with large docx files. But this is the code:
library(readtext)
data_files <- list.files(path = "PATH", full.names = T, recursive = T) # PATH = the path to the folder where the documents are located
extracted_texts <- readtext(data_files, docvarsfrom = "filepaths", dvsep = "/", verbosity = 3, ignore_missing_files = TRUE) # this is to extract the text in the files
write.csv2(extracted_texts, file = "data/text_extracts.csv", fileEncoding = "UTF-8") # this is to export the files into csv