How to detect and remove unwanted lines from a string?

Question

I am working on a project in which i have to extract text data from a PDF.

I am able to extract text from the PDF, but extracted text sometimes contains lines which i would like to strip off from it.

Here's and example of unwanted lines -

ISBN 0-7225-3293-8. = CONTENTS = Part One Part Two Epilogue

Page 1 / 94

And, here's an example of good line (which i'd like to keep) -

Dusk was falling as the boy arrived with his herd at an abandoned church.

I wanted to sleep a little longer, he thought. He had had the same dream that night as a week ago

Different PDFs can give out different unwanted lines.

How can i detect them ?

is there a common pattern to the lines you don't want? if so, use a regular expression to find them. — Cruiser, Aug 02 '16 at 13:11
If you have the content line per line, you can determine the rules of a bad line. (Like a REGEX), and then use the String.matches to determine whether or not the bad REGEX is matched. — Dylan Meeus, Aug 02 '16 at 13:11
I have the content line by line. There are some common patterns like "Page 12/90". But some i don't even know like "ISBN 0-7225-3293-8.", "= CONTENTS = ", "Part One ", "Part Two ", "Epilogue " — Ashish Gogna, Aug 02 '16 at 13:15
Javascript for the backend (Node.js) and Java for Android. I can do this detection work in the backend or the android app. — Ashish Gogna, Aug 02 '16 at 13:19
Similar solution is found in below link http://stackoverflow.com/questions/18098400/how-to-get-raw-text-from-pdf-file-using-java — Jitendra Kumar. Balla, Aug 02 '16 at 13:40
@JitendraKumar.Balla, this doesn't exactly solve my problem... — Ashish Gogna, Aug 02 '16 at 13:51
since you are able to read the text from pdf, it should basically be String. If not, convert text to String and you may use String class methods like .startsWith("ISBN") || .startsWith("Page") etc. so on — JavaHopper, Aug 02 '16 at 15:16

Luke · Answer 1 · 2016-08-02T16:46:34.730

Option 1 - Give the computer a rule: If you are able to narrow down what content it is that you would like to keep, the obvious criteria that sticks out to me is the exclusion of special characters, then you can filter your results based on this.

So let's say you agree that all "good lines" will be without special characters ('/', '-', and '=') for example, if a line DOES contain one of these items, you know you can remove it from the content you are keeping. This could be done in a for loop containing an if-then condition that looks something like this..

var lineArray = //code needed to make each line of the file an element of the array

For (cnt = 0; cnt < totalLines; cnt++)
{
    var line = lineArray[cnt];
    if (line.contains("/") || line.contains("-") || line.contains("="))
        lineArray[cnt] = ""; 
}

At the end of this code you could simply get all the text within the array and it would no longer contain the unwanted lines. If there are unwanted lines however, that are virtually indistinguishable by characters, length, positioning etc. the previous approach begins to break down on some of the trickier lines.

This is because there is no rule you can give the computer to distinguish between the good and the bad without giving it a brain such as yours that recognizes parts of speech and sentence structure. In which case you might consider option 2, which is just that.

Option 2- Give the computer a brain: Given that the text you want to remove will more or less be incoherent documentation based on what you have shown us, an open source (or purchased) natural language processor may be what you are looking for.

I found a good beginner's intro at http://myreaders.info/10_Natural_Language_Processing.pdf with some information that might be of use to you. From the source,

"Linguistics is the science of language. Its study includes:

sounds (phonology),
word formation (morphology),
sentence structure (syntax),
meaning (semantics), and understanding (pragmatics) etc.

Syntactic Analysis : Here the analysis is of words in a sentence to know the grammatical structure of the sentence. The words are transformed into structures that show how the words relate to each others. Some word sequences may be rejected if they violate the rules of the language for how words may be combined. Example: An English syntactic analyzer would reject the sentence say : 'Boy the go the to store.' "

Using some sort of NLP, you can discover whether a given section of text contains a sentence or some incoherent rambling. This test could then be used as a filter in your program for what you would like to keep or remove.

Side note- As it appears your sample text is not just sentences but literature, sometimes characters will speak in sentence fragments as part of their nature given by the author. In this case, you could add a separate condition that if the text is contained within two quotations and has no special characters, you want to keep the text regardless.

In the end NLP may be more work than you require or that you want to do, in which case Option 1 is likely going to be your best bet. On the other hand, it may be just the thing you are looking for. Whatever the case or if you decide you need some combination of the two, best of luck! I hope this answer helps.

How to detect and remove unwanted lines from a string?

1 Answers1