0

I am now trying to extract a large number of docx files (1500) placed in one folder, using readtext (after creating a list using list.files)

You can find similar examples here: https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html

I am getting errors with some files (examples below), the problem is when this error occurs, the extraction process is stopped. I can identify the problematic file, by changing verbosity = 3, but then I have to restart the extraction process (to find another problematic file(s)).

My question is if there is a way to avoid interrupting the process if an error is encountered?

I change ignore_missing_files = TRUE but this did not fix the problem.

examples for the errors encountered:

write error in extracting from zip file
Error: 'C:\Users--- c/word/document.xml' does not exist.

Sorry for not posting a reproducible example, but I do not know how to post an example with large docx files. But this is the code:

library(readtext)
 
data_files <- list.files(path = "PATH", full.names = T, recursive = T)   # PATH = the path to the folder where the documents are located
extracted_texts <- readtext(data_files, docvarsfrom = "filepaths", dvsep = "/", verbosity = 3, ignore_missing_files = TRUE) # this is to extract the text in the files

 
write.csv2(extracted_texts, file = "data/text_extracts.csv", fileEncoding = "UTF-8") # this is to export the files into csv

Bahi8482
  • 345
  • 2
  • 9
  • 3
    Please provide your code. – jsb Aug 07 '20 at 16:05
  • 2
    To continue processing in the presence of an error, use `tryCatch(readtext(...), error = function(e) {...})` (with some logic, perhaps as simple as returning the error for looking at later). – r2evans Aug 07 '20 at 16:22
  • The `document.xml` is the part of a docx file that actually contains text. If it is not there, the file is probably corrupted or has an incorrect file ending. – JBGruber Aug 07 '20 at 17:00

1 Answers1

2

Let's first put together a reproducible example:

download.file("https://file-examples-com.github.io/uploads/2017/02/file-sample_1MB.docx", "test1.docx")
writeLines("", "test2.docx")

The first file I produced here should be a proper docx file, the second one is rubbish.

I would wrap readtext in a small function that deals with the errors and warnings:

readtext_safe <- function(f) {
  out <- tryCatch(readtext::readtext(f), 
                  error = function(e) "fail",
                  warning = function(e) "fail")
  if (isTRUE("fail" == out)) {
    write(f, "errored_files.txt", append = TRUE)
  } else {
    return(out)
  }
}

Note that I treat errors and warning the same, which might not be what you actually want. We can use this function to loop through your files:

files <- list.files(pattern = ".docx$", ignore.case = TRUE, full.names = TRUE)

x <- lapply(files, readtext_safe)
x
#> [[1]]
#> readtext object consisting of 1 document and 0 docvars.
#> # Description: df[,2] [1 × 2]
#>   doc_id     text               
#>   <chr>      <chr>              
#> 1 test1.docx "\"Lorem ipsu\"..."
#> 
#> [[2]]
#> NULL

In the resulting list, failed files simply have a NULL entry as nothing is returned. I like to write out a list of these errored files and the function above creates a txt file that looks like this:

readLines("errored_files.txt")
#> [1] "./test2.docx"
JBGruber
  • 8,083
  • 1
  • 13
  • 35
  • thanks for the detailed answer. I have been looking for a fix for a while and this worked. I am trying to create a csv or excel file where each report will be in a cell, so I can use it to extract certain strings. I use this code: write.csv2(x, file = "data/text_extracts.csv", fileEncoding = "UTF-8") but I get this error: Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 0, 1 do you have suggestions? I also tried to change "x" to data.frame but gave the same error. – Bahi8482 Aug 07 '20 at 19:17
  • If this work, please accept the answer and possibly upvote it if you found it useful. I'm not sure what this error means but it seems that the best way to export this to Excel would be `x_df – JBGruber Aug 07 '20 at 20:01