1

I have some questions related to setting the maximum running time in Python. In fact, I would like to use pdfminer to convert the PDF files to .txt. The problem is that very often, some files are not possible to decode and take an extremely long time. So I want to set time.time() to limit the conversion time for each file to 20 seconds. In addition, I run under Windows so I cannot use signal function.

I succeeded in running the conversion code with pdfminer.convert_pdf_to_txt() (in my code it is "c"), but I could not integrate the time.time() in the while loop. It seems to me that in the following code, the while loop and time.time() do not work.

In summary, I want to:

  1. Convert the PDf file to a .txt file

  2. The time limit for each conversion is 20 seconds. If it runs out of time, throw an exception and save an empty file

  3. Save all the txt files under the same folder

  4. If there are any exceptions/errors, still save the file, but with empty content.

Here is the current code:

import converter as c
import os
import timeit
import time

yourpath = 'D:/hh/'

for root, dirs, files in os.walk(yourpath, topdown=False):

    for name in files:

        t_end = time.time() + 20

        try:
            while time.time() < t_end:

                c.convert_pdf_to_txt(os.path.join(root, name))

                t = os.path.split(os.path.dirname(os.path.join(root, name)))[1]
                a = str(os.path.split(os.path.dirname(os.path.join(root, name)))[0])

                g = str(a.split("\\")[1])
                with open("D:/f/" + g + "&" + t + "&" + name + ".txt", mode="w") as newfile:
                    newfile.write(c.convert_pdf_to_txt(os.path.join(root, name)))
                    print "yes"

            if time.time() > t_end:

                print "no"

                with open("D:/f/" + g + "&" + t + "&" + name + ".txt", mode="w") as newfile:
                    newfile.write("")

        except KeyboardInterrupt:
           raise

        except:
            for name in files:
                t = os.path.split(os.path.dirname(os.path.join(root, name)))[1]
                a = str(os.path.split(os.path.dirname(os.path.join(root, name)))[0])

                g = str(a.split("\\")[1])
                with open("D:/f/" + g + "&" + t + "&" + name + ".txt", mode="w") as newfile:
                    newfile.write("")
Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
SXC88
  • 167
  • 1
  • 4
  • 15
  • a helpful link. http://stackoverflow.com/questions/13293269/how-would-i-stop-a-while-loop-after-n-amount-of-time – Stormvirux Nov 22 '16 at 14:29
  • @Stormvirux Yes I read this post before completing the above code...I still could not figure out how to integrate in my code ;( – SXC88 Nov 22 '16 at 14:30
  • @SXC88 - Just finished my answer, hope it helps! – linusg Nov 22 '16 at 14:53
  • No version of this will work since there is nothing in here that will interrupt an ongoing conversion that takes longer than 20s. – pvg Nov 22 '16 at 14:55
  • 1
    @pvg - What do you mean? – linusg Nov 22 '16 at 14:58
  • @pvg could you suggest some solutions using time.time() for example?? – SXC88 Nov 22 '16 at 15:04
  • @SXC88 - Done, give it a try! – linusg Nov 22 '16 at 15:45
  • Oh since it is an embedded function thread.interrupt_main, I understand that probably you have no way to get around it! ;// – SXC88 Nov 22 '16 at 15:59
  • @SXC88 I don't think you can do this with time.time() which simply measures time. The design itself (20 seconds to convert) seems pretty questionable to begin with, do you really need to do that? What are you actually trying to accomplish? – pvg Nov 22 '16 at 16:44
  • http://stackoverflow.com/questions/40748555/python-threading-timer-set-time-limit – SXC88 Nov 22 '16 at 17:56

1 Answers1

1

You have the wrong approach.

You define the end time and immediately enter the while loop if the current timestamp is lower than the end timestamp (will be always True). So the while loop is entered and you get stuck at the converting function.

I would suggest the signal module, which is already included in Python. It allows you to quit a function after n seconds. A basic example can be seen in this Stack Overflow answer.

Your code would be like this:

return astring
import converter as c
import os
import timeit
import time
import threading
import thread

yourpath = 'D:/hh/'

for root, dirs, files in os.walk(yourpath, topdown=False):
    for name in files:
        try:
            timer = threading.Timer(5.0, thread.interrupt_main)
            try:
                c.convert_pdf_to_txt(os.path.join(root, name))
            except KeyboardInterrupt:
                 print("no")

                 with open("D:/f/" + g + "&" + t + "&" + name + ".txt", mode="w") as newfile:
                     newfile.write("")
            else:
                timer.cancel()
                t = os.path.split(os.path.dirname(os.path.join(root, name)))[1]
                a = str(os.path.split(os.path.dirname(os.path.join(root, name)))[0])
                g = str(a.split("\\")[1])

                print("yes")

                with open("D:/f/" + g + "&" + t + "&" + name + ".txt", mode="w") as newfile:
                    newfile.write(c.convert_pdf_to_txt(os.path.join(root, name)))

        except KeyboardInterrupt:
           raise

        except:
            for name in files:
                t = os.path.split(os.path.dirname(os.path.join(root, name)))[1]
                a = str(os.path.split(os.path.dirname(os.path.join(root, name)))[0])

                g = str(a.split("\\")[1])
                with open("D:/f/"+g+"&"+t+"&"+name+".txt", mode="w") as newfile:
                    newfile.write("")

Just for the future: Four spaces indentation and not too much whitespace ;)

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
linusg
  • 5,719
  • 4
  • 24
  • 68
  • Hi thank you for your comment! But the problem is that I use the module pdfminer which can only be used under python 2.X so I don't think signal function is available here (It always throw an error actually). In addition, with your code an error message pops up and indicates that there is a syntax error with "except KeyboardInterrupt"...I am so confused...;(( – SXC88 Nov 22 '16 at 15:04
  • @SXC88 - Sorry, I had some wrong indentation. Is fixed now. The ``signal`` module is available under Python 2 too: https://docs.python.org/2/library/signal.html. Please run again, tell me if the syntax error is gone an what error you get about the signal module. – linusg Nov 22 '16 at 15:08
  • Than you! But it always throws me this message "line 14, in signal.signal(signal.SIGALRM, timeout_handler) AttributeError: 'module' object has no attribute 'SIGALRM'" ... – SXC88 Nov 22 '16 at 15:11
  • Ah ok. So its not the signal module being not available, but an not existing variable. Wait a sec, I'll fix that. – linusg Nov 22 '16 at 15:14
  • @SXC88 - I'm very sorry, but I guess there's no fast solution for this. The ``signal`` module has some Unix-only parts, and I'm runing Ubuntu. I assume you have windows, so I can't test it for you :( However, hopefully you know what was wrong with your code now and I could help at least a bit! – linusg Nov 22 '16 at 15:20
  • Yes that is also what I understood ;(...Thank you very much anyways! – SXC88 Nov 22 '16 at 15:21
  • hey I found this post which is quite similar to my problem, maybe you have an idea about how to integrate the thread function into your above code? Thx! http://stackoverflow.com/questions/2933399/how-to-set-time-limit-on-input – SXC88 Nov 22 '16 at 15:27
  • Sure, that's quite similar to what we have for now. Just one minute! – linusg Nov 22 '16 at 15:34
  • hey I think we are almost there... The only problem is that each time it runs out of time, I need to press "control+c" mannully to continue the program (and it prints"no"). Is there a way to avoid this manipulation and just save an empty file and pass to the next one once it runs out of time?? – SXC88 Nov 22 '16 at 15:54
  • @SXC88 - try replacing the ``raise`` in the second ``KeyboardInterrupt`` by ``continue`` (Just a guess as I can't test the code without the pdf library) – linusg Nov 22 '16 at 16:05
  • it is the same situation . In fact I think the problem is located in the first KeyboardInterrupt...The second one just enable me to stop the program when I need. – SXC88 Nov 22 '16 at 16:11
  • @SXC88 Then try placing the continue there. It should actually move on to the next iteration in the for loop. Report if it works :) – linusg Nov 22 '16 at 16:17
  • It still doesn't work...But thank you anyway! You've been very helpful! – SXC88 Nov 22 '16 at 16:21
  • @SXC88 - You can just play around with it a little more and comment here on your progress. Just consider accepting or upvoting my answer if it was helpful to you :) – linusg Nov 22 '16 at 16:29
  • hey I just post my new code, maybe you want to have a look :) http://stackoverflow.com/questions/40748555/python-threading-timer-set-time-limit – SXC88 Nov 22 '16 at 17:56