18

I'm writing a program in Java and one of the things that I need to do is to create a set of every valid location for a shortest path problem. The locations are defined in a .txt file that follows a strict pattern (one entry per line, no extra whitespace) and is perfect for using .nextLine to get the data. My problem is that 241 lines into the file (out of 432) the scanner stops working 3/4 of the way through an entry and doesn't recognize any new lines.

My code:

    //initialize state space
private static Set<String> posible(String posLoc) throws FileNotFoundException {
    Scanner s = new Scanner(new File(posLoc));
    Set<String> result = new TreeSet<String>();
    String availalbe;
    while(s.hasNextLine()) {
        availalbe = s.nextLine();
        result.add(availalbe);
    }
    s.close();
    return result;
}

The Data

Shenlong Gundam
Altron Gundam
Tallgee[scanner stops reading here]se
Tallgeese II
Leo (Ground)
Leo (Space)

Of course, "scanner stops reading here" is not in the data, I'm just marking where scanner stops reading the file. This is 3068 bytes into the file, but that shouldn't affect anything because in the same program, with nearly identical code, I'm reading a 261-line, 14KB .txt file that encodes the paths. Any help would be appreciated.

Thank you.

Fizzmaister
  • 183
  • 1
  • 1
  • 5
  • 1
    Could you upload the actual data file somewhere where we could take a look at it? – NPE Nov 30 '11 at 18:07
  • 1
    Are there any exceptions thrown? Are there any empty catch blocks? – Hovercraft Full Of Eels Nov 30 '11 at 18:10
  • I hope pastebin works for everyone. [data](http://pastebin.com/rt3mbXtD) – Fizzmaister Nov 30 '11 at 18:18
  • Oh, and no exceptions are thrown. I'm not using try catch because I'm lazy and I can guarantee the location of the file because only I'm using it and no one else. – Fizzmaister Nov 30 '11 at 18:25
  • What happens if you put a println inside of your while loop? – Bryan Nov 30 '11 at 18:31
  • It prints everything just like it should, until it reaches "Tallgeese" and then it only prints "Tallgee" and then it continues to the UI loop. – Fizzmaister Nov 30 '11 at 18:36
  • I'm sorry, Hovercraft, I don't understand the second question. – Fizzmaister Nov 30 '11 at 18:41
  • Hmm, I would try deleting a few entries from the beginning of the file... then see if it stops at the same line or a different line. – Bryan Nov 30 '11 at 18:44
  • Is the uploaded file the exact one you are using? I downloaded it (raw) and it ran fine with your code. The only strangeness I can see is that the "Turn A Gundam" and "Turn A Gundam (True Power)" has a strange character in front. – Roger Lindsjö Nov 30 '11 at 18:47
  • I've tried removing from the beginning and near the problem spot (including the problem spot), both times it ended the same distance into the file (just under 3KB) and in the middle of an entry (a different one in each scenario). – Fizzmaister Nov 30 '11 at 18:49
  • yes, that's the exact file (admittedly copy & paste, not upload), and thanks for catching the odd character, it didn't show up in notepad++. – Fizzmaister Nov 30 '11 at 18:53
  • What about copying and pasting the contents of your current file into a new file and trying it on that. Very strange indeed... – Bryan Nov 30 '11 at 19:10
  • Gah! Copy and paste also fixed the problem. I guess there was just something weird with the original file. – Fizzmaister Nov 30 '11 at 19:28
  • Thanks to everyone for reporting the problem and solutions. I ran into this problem and Scanner just silently dropped the rest of the file. There was a non-UTF character in the file but Scanner died later in the file. I replaced the line where it died reading with characters I typed in from the keyboard and it kept dying the same number of characters into that line. Weird, but fixing my perl code to write out UTF-8 fixed my problem. – Sol Feb 26 '15 at 04:57

8 Answers8

20

There's a problem with Scanner reading your file but I'm not sure what it is. It mistakenly believes that it's reached the end of file when it has not, possibly due to some funky String encoding. Try using a BufferedReader object that wraps a FileReader object instead.

e.g.,

   private static Set<String> posible2(String posLoc) {
      Set<String> result = new TreeSet<String>();
      BufferedReader br = null;
      try {
         br = new BufferedReader(new FileReader(new File(posLoc)));
         String availalbe;
         while((availalbe = br.readLine()) != null) {
             result.add(availalbe);            
         }
      } catch (FileNotFoundException e) {
         e.printStackTrace();
      } catch (IOException e) {
         e.printStackTrace();
      } finally {
         if (br != null) {
            try {
               br.close();
            } catch (IOException e) {
               e.printStackTrace();
            }
         }
      }
      return result;
  }

Edit
I tried reducing your problem to its bare minimum, and just this was enough to elicit the problem:

   public static void main(String[] args) {
      try {
         Scanner scanner = new Scanner(new File(FILE_POS));
         int count = 0;
         while (scanner.hasNextLine()) {
            String line = scanner.nextLine();
            System.out.printf("%3d: %s %n", count, line );
            count++;
         }

I checked the Scanner object with a printf:

System.out.printf("Str: %-35s size%5d; Has next line? %b%n", availalbe, result.size(), s.hasNextLine());

and showed that it thought that the file had ended. I was in the process of progressively deleting lines from the data to file to see which line(s) caused the problem, but will leave that to you.

Hovercraft Full Of Eels
  • 276,051
  • 23
  • 238
  • 346
  • Thanks, it worked. I have no idea what's wrong with the scanner, but that reads everything. – Fizzmaister Nov 30 '11 at 19:15
  • 3
    And so we will never know. :/ – Bryan Nov 30 '11 at 19:19
  • @Bryan: hopefully Fizzmaister will find the problem and report back. I would, but I'm behind on office work! :o – Hovercraft Full Of Eels Nov 30 '11 at 19:21
  • This is actually somewhat embarrassing, but I can't reproduce the error now. I commented out my old method and added this method (no problem), copy pasted the data into a new file and tried using the old method (no problem), switched back to the first file and old method (still no problem). This seems like it's going to be a mystery for the ages. – Fizzmaister Nov 30 '11 at 19:41
  • 2
    Copy and paste of file content may have changed the encoding. You'd be surprised what some text editors will do automagically. – rfeak Nov 30 '11 at 21:02
8

I encountered the same problem and this is what I did to fix it:

1.Saved the file I was reading from into UTF-8
2.Created new Scanner like below, specifying the encoding type:


   Scanner scanner = new Scanner(new File("C:/IDSBRIEF/GuidData/"+sFileName),"UTF-8");   
RNJ
  • 14,308
  • 16
  • 73
  • 125
Learner123
  • 81
  • 1
  • 1
  • This solved my problem. Basically my scanner was assuming one encoding, while Notepad++ was assuming another. When I specified the same encoding in both places, my problem was solved. – Martin Jan 27 '14 at 22:18
5

I was having the same problem. The scanner would not read to the end of a file, actually stopping right in the middle of a word. I thought it was a problem with some limit set on the scanner, but I took note of the comment from rfeak about character encoding.

I re-saved the .txt I was reading into UTF-8, it solved the problem. It turns out that Notepad had defaulted to ANSI.

frogatto
  • 26,401
  • 10
  • 73
  • 111
The Aa of Ron
  • 51
  • 1
  • 1
1

My case:

  • in my main program (A) it always reads 16384 bytes from a 41021 byte file. The character where it stops is in the middle of a line with normal printable text
  • if I create a small separate program (B) with only the Scanner and print lines, it reads the whole file
  • specifying "UTF-8" in (A) still reads 16384
  • specifying "ASCII" in (A) still reads 16384
  • specifying "Cp1252" in (A) reads the whole file
  • my input txt files are sent by users and I can't be sure that they will write them in any particular encoding

Conclusions

  • Scanner seems to read the file block by block and writes the correctly read data into the return String, but when it finds a block with a different encoding than it is expecting, it exits silently (ouch) and returns the partial string
  • the txt file I'm trying to read is Cp1252, my (A) source file is UTF-8 and my (B) source file is Cp1252 so that's why (B) worked without specifying an encoding

Solution

  • forget about Scanner and use

String fullFileContents = new String(Files.readAllBytes(myFile.toPath()));

Of course, non-ascii characters can't be reliably read like this as you don't know the encoding, but the ascii characters will be read for sure. Use it if you only need the ascii characters in the file and the non-ascii part can be discarded.

golimar
  • 1,942
  • 17
  • 24
  • 1
    ok, forget about Scanner, but don't forget about Charset! Without knowing the input charset, you cannot reliably convert bytes to String. I have been burnt many times by these kind of errors. Even an end of line can casually be interpreted the wrong way. The worst thing is that it can take years of everyday use for these bugs to show up. You have been warned. – Marcello Nuccio Oct 25 '18 at 16:30
  • in my case it was the Scanner method that took years of everyday use to show up ;) (the files are supposed to be ascii-only until someone managed to add some strange character and write the file in an encoding that was not the same as the Java source file encoding...) – golimar Oct 25 '18 at 22:43
  • The thing that I am not understanding is: why don't you specify the charset? The platform's default charset is a random value if you are not *really* careful to always set it properly. If you want to use the ASCII encoding, then why do not use `new String(bytes, StandardCharsets.US_ASCII)` instead of `new String(bytes)`? Side note: ASCII is a seven-bit encoding, maybe ISO_8859_1 is a better bet. – Marcello Nuccio Oct 26 '18 at 08:35
  • I want only the ascii part (they are system commands) and discard the possibly-non-ascii part (user comments, encoded sometimes in ISO_8859_1 and sometimes UTF-8 or even any other encoding, depending on the user and the programs they used to create, transfer or copy-paste the files). So the important thing for me is to make sure the whole file is read – golimar Oct 26 '18 at 08:50
  • But why don't you specify the charset? – Marcello Nuccio Oct 26 '18 at 10:30
  • Because the charset of the input files is unknown – golimar Oct 26 '18 at 11:59
  • Leaving the charset unspecified, does NOT mean that the [charset is autodetected](https://github.com/albfernandez/juniversalchardet), it only means that the platform's default charset is used. But: ["The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system."](https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html#defaultCharset--) In other words: you don't know it. Then you are reading an unknown input with an unknown charset. – Marcello Nuccio Oct 26 '18 at 15:46
  • Of course. But I am discarding the non-ascii part in my specific case – golimar Oct 27 '18 at 09:16
0

I had a txt file in which Scanner stopped reading at line 862, it was a weird problem. What I did was creating a different file (to try to replicate the problem). I added it less than 862 lines first, then I added more than 862 and it worked fine.

So I believe that the problem was that on my previous file, at line 862, there was something wrong, like some character or symbol that could have misled Scanner to finish reading early.

In conclusion: based on this experience I recommend finding out the exact line where scanner stops reading to find a solution for kind of problems.

evaldeslacasa
  • 552
  • 5
  • 17
0

I also had similar issue on my Linux server and finally below code worked for me.

Scanner scanner = new Scanner(new File(filename),"UTF-8");

0

I had the same problem with a csv file: it worked on Windows but it didn't work on Linux

Open file with nodepad++ and change encodage, choose : Encode in UTF8 (with BOM). It solved problem in my case

anakin59490
  • 510
  • 1
  • 9
  • 23
-3

You should use this :

Scanner scanner = new Scanner(fileObj).useDelimiter("\z");
System.out.println(scanner.next());

radu florescu
  • 4,199
  • 10
  • 56
  • 90
  • 2
    This does not even compile, and if you corrected it to compile, it does not come close to solving this problem. – wvdz Apr 21 '15 at 10:23