14

I am having weird behavior with Scanner. It will work with a particular set of files I am using when I use the Scanner(FileInputStream) constructor, but it won't with the Scanner(File) constructor.

Case 1: Scanner(File)

Scanner s = new Scanner(new File("file"));
while(s.hasNextLine()) {
    System.out.println(s.nextLine());
}

Result: no output

Case 2: Scanner(FileInputStream)

Scanner s = new Scanner(new FileInputStream(new File("file")));
while(s.hasNextLine()) {
    System.out.println(s.nextLine());
}

Result: the file content outputs to the console.

The input file is a java file containing a single class.

I double checked programmatically (in Java) that:

  • the file exists,
  • is readable,
  • and has a non-zero filesize.

Typically Scanner(File) works for me in this case, I am not sure why it doesn't now.

haylem
  • 21,453
  • 3
  • 63
  • 92
kashiko
  • 173
  • 1
  • 6
  • And is that the only code, or is there other things happening around all that? This snippet seems incomplete, as there would be at least some exception handling taking place. Could you provide us with the whole code? – haylem Feb 29 '12 at 02:19
  • Interesting question. Please post your actual code and a pastebin with your file. Also, what is the output of `Charset.defaultCharset()` on your system? – Perception Feb 29 '12 at 02:41
  • @Perception: I thought of that as well, but the source of Scanner seems to hint that they use the default charset in both cases, if not using a constructor that would specify it explictly. – haylem Feb 29 '12 at 02:49
  • @kashiko: Ah, another **very important** follow-up question: what's the size of the file? – haylem Feb 29 '12 at 02:50
  • I have updated my original post to have code copied from my source file. Just as a test I am reading the file and outputting it to the terminal. The file is a java source file form an open source project. My character set is UTF-8. The size of the file is 18357 bytes. – kashiko Feb 29 '12 at 02:54
  • size does not matter, look at my answer below (i found out how it happens, not why actually) – guido Feb 29 '12 at 03:37
  • Wow, I was just having the opposite problem (works with `File`, not with `FileInputStream`). I don't know if it's related but +1 nonetheless. Wasted a good hour on this. – rath Jan 15 '16 at 10:10

2 Answers2

7

hasNextLine() calls findWithinHorizon() which in turns calls findPatternInBuffer(), searching a match for a line terminator character pattern defined as .*(\r\n|[\n\r\u2028\u2029\u0085])|.+$

Strange thing is that with both ways to construct a Scanner (with FileInputStream or via File), findPatternInBuffer returns a positive match if the file contains (independently from file size) for instance the 0x0A line terminator; but in the case the file contains a character out of ascii (ie >= 7f), using FileInputStream returns true while using File returns false.

Very simple test case:

create a file which contains just char "a"

# hexedit file     
00000000   61 0A                                                a.

# java Test.java
using File: true
using FileInputStream: true

now edit the file with hexedit to:

# hexedit file
00000000   61 0A 80                                             a..

# java Test.java
using File: false
using FileInputStream: true

in the test java code there is nothing else than what already in the question:

import java.io.*;
import java.lang.*;
import java.util.*;
public class Test {
    public static void main(String[] args) {
        try {
                File file1 = new File("file");
                Scanner s1 = new Scanner(file1);
                System.out.println("using File: "+s1.hasNextLine());
                File file2 = new File("file");
                Scanner s2 = new Scanner(new FileInputStream(file2));
                System.out.println("using FileInputStream: "+s2.hasNextLine());
        } catch (IOException e) {
                e.printStackTrace();
        }
    }
}

SO, it turns out this is a charset issue. In facts, changing the test to:

 Scanner s1 = new Scanner(file1, "latin1");

we get:

# java Test 
using File: true
using FileInputStream: true
guido
  • 17,668
  • 4
  • 66
  • 89
  • Interesting. When looking at the `Scanner` contrustors they all seem to be assuming the default charset if not specified, yet there's a difference at runtime as you point out. Maybe the channel used internally maybe force a different one, one level deeper? I'm wondering... Will try check when I get a chance. – haylem Feb 29 '12 at 13:42
5

From looking at the Oracle/Sun JDK's 1.6.0_23 implementation of Scanner, the Scanner(File) constructor invokes a FileInputStream, which is meant for raw binary data.

This points to a difference in buffering and parsing technique used when invoking one constructor or another, which will directly impact your code on the call to hasNextLine().

Scanner(InputStream) uses an InputStreamReader while Scanner(File) uses an InputStream passed to a ByteChannel (and probably reads the whole file in one jump, thus advancing the cursor, in your case).

haylem
  • 21,453
  • 3
  • 63
  • 92
  • 2
    The contract for Java(File) and Java(FileInputStream) read the same though, so they should produce the same behavior from the API user's point of view. I have used Java(File) before without this issue. – kashiko Feb 29 '12 at 02:59
  • Yanick: Thanks, this is an interesting question. But there seems to be more to this... (Still, the stuff you can dig up from the JDK's code sometimes... Had a "What??" moment when I noticed there are multiple definitions of `ArrayList`, for instance (and no, they aren't exactly identical). – haylem Feb 29 '12 at 03:05