spark ssc.textFileStream is not streamining any files from directory

Question

I am trying to execute below code using eclipse (with maven conf) with 2 worker and each have 2 core or also tried with spark-submit.

public class StreamingWorkCount implements Serializable {

    public static void main(String[] args) {
        Logger.getLogger("org.apache.spark").setLevel(Level.WARN);
        JavaStreamingContext jssc = new JavaStreamingContext(
                "spark://192.168.1.19:7077", "JavaWordCount",
                new Duration(1000));
        JavaDStream<String> trainingData = jssc.textFileStream(
                "/home/bdi-user/kaushal-drive/spark/data/training").cache();
        trainingData.foreach(new Function<JavaRDD<String>, Void>() {

            public Void call(JavaRDD<String> rdd) throws Exception {
                List<String> output = rdd.collect();
                System.out.println("Sentences Collected from files " + output);
                return null;
            }
        });

        trainingData.print();
        jssc.start();
        jssc.awaitTermination();
    }
}

And log of that code

15/01/22 21:57:13 INFO FileInputDStream: New files at time 1421944033000 ms:

15/01/22 21:57:13 INFO JobScheduler: Added jobs for time 1421944033000 ms
15/01/22 21:57:13 INFO JobScheduler: Starting job streaming job 1421944033000 ms.0 from job set of time 1421944033000 ms
15/01/22 21:57:13 INFO SparkContext: Starting job: foreach at StreamingKMean.java:33
15/01/22 21:57:13 INFO DAGScheduler: Job 3 finished: foreach at StreamingKMean.java:33, took 0.000094 s
Sentences Collected from files []
-------------------------------------------
15/01/22 21:57:13 INFO JobScheduler: Finished job streaming job 1421944033000 ms.0 from job set of time 1421944033000 ms
Time: 1421944033000 ms
-------------------------------------------15/01/22 21:57:13 INFO JobScheduler: Starting job streaming job 1421944033000 ms.1 from job set of time 1421944033000 ms


15/01/22 21:57:13 INFO JobScheduler: Finished job streaming job 1421944033000 ms.1 from job set of time 1421944033000 ms
15/01/22 21:57:13 INFO JobScheduler: Total delay: 0.028 s for time 1421944033000 ms (execution: 0.013 s)
15/01/22 21:57:13 INFO MappedRDD: Removing RDD 5 from persistence list
15/01/22 21:57:13 INFO BlockManager: Removing RDD 5
15/01/22 21:57:13 INFO FileInputDStream: Cleared 0 old files that were older than 1421943973000 ms: 
15/01/22 21:57:13 INFO FileInputDStream: Cleared 0 old files that were older than 1421943973000 ms: 
15/01/22 21:57:13 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()

The Problem is that, i am not getting data form the file which is in the directory. Please help me.

Facing exactly same issue on windows machine. Please suggest — Gaurav Khare, Sep 08 '16 at 11:23
see also this question (and answer): https://stackoverflow.com/questions/33704326/spark-filestreaming-issue — Oliver Hummel, Dec 02 '19 at 09:34

score 13 · Accepted Answer · answered Jan 22 '15 at 19:22

13

Try it with another directory and then copy these files to that directory, while the job is running.

answered Jan 22 '15 at 19:22

pzecevic

2,697
20
21

yes, i have also tried that with another dir. i didn't understand what is the problem and how to debug, even which is not showing in log. – Kaushal Jan 22 '15 at 19:29
1

But was the directory empty when you started the job? – pzecevic Jan 22 '15 at 19:35
Actually some files are already there and i also copy some files when i start my job. – Kaushal Jan 22 '15 at 19:51
Is this HDFS directory or a local one? – pzecevic Jan 22 '15 at 20:06
2

It should be HDFS and you should copy files in it while the program is running – pzecevic Jan 22 '15 at 20:15
you mean that dir should be empty at the starting and when i start my job, copied into hdfs . – Kaushal Jan 22 '15 at 20:22
1

yes, @pzecevic your are right. Spark processed only those files that copied into HDFS after the job execution, it is not read previous files that are in directory. – Kaushal Jan 23 '15 at 07:54

score 5 · Answer 2 · answered Oct 09 '15 at 05:31

5

had the same problem. Here is my code:

lines = jssc.textFileStream("file:///Users/projects/spark/test/data');

the TextFileSTream is very sensitive; what i ended up doing was:

1. Run Spark program
2. touch datafile
3. mv datafile datafile2
4. mv datafile2  /Users/projects/spark/test/data

and that did it.

answered Oct 09 '15 at 05:31

matthieu lieber

652
14
29

I am using windows lines = jssc.textFileStream("file:///c:/data'); lines.foreachRDD(file=> { file.foreach(fc=> { println(fc) }) }) I am not getting output. how to do resolve this ? – Gnana Mar 26 '18 at 01:30

tgpfeiffer · Answer 3 · 2015-01-26T10:03:15.820

1

I think you need to add the scheme, i.e. file:// or hdfs:// in front of your path.

Undoing the edit to my comment because: It is in fact file:// and hdfs:// which needs to be added "in front of" the path, so the total path becomes file:///tmp/file.txt or hdfs:///user/data. If there is no NameNode set in the configuration, the latter needs to be hdfs://host:port/user/data.

edited Jan 26 '15 at 10:03

answered Jan 23 '15 at 07:51

tgpfeiffer

1,253
1
11
18

1

using HDFS, it works but when i use local file system with 'file:///'(spark does not support file://) prefix, it is not working. – Kaushal Jan 23 '15 at 08:21
1

That may be because you are using a cluster and the path specified must be accessible by all Spark executors, i.e. it is not enough if the Spark driver can access it. – tgpfeiffer Jan 26 '15 at 01:06

score 0 · Answer 4 · answered Apr 25 '16 at 21:37

JavaDoc suggests function only streams new files.

Ref: https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/streaming/api/java/JavaStreamingContext.html#textFileStream(java.lang.String)

Create an input stream that monitors a Hadoop-compatible filesystem for new files and reads them as text files (using key as LongWritable, value as Text and input format as TextInputFormat). Files must be written to the monitored directory by "moving" them from another location within the same file system. File names starting with . are ignored.

score 0 · Answer 5 · answered Nov 29 '16 at 03:57

0

textFileStream can only monitor a folder when the files in the folder are being added or updated.

If you just want to read files, you can rather use SparkContext.textFile.

answered Nov 29 '16 at 03:57

wzktravel

51
1
2

froblesmartin · Answer 6 · 2017-05-13T10:52:19.060

0

You have to take in count that Spark Streaming will only read the new files in the directory, no the updated ones (once they are in the directory) and also they all must have the same format.

Source

edited May 13 '17 at 10:52

answered Mar 11 '17 at 19:31

froblesmartin

925
14
17

score 0 · Answer 7 · answered Jan 29 '21 at 16:04

0

I've bee scratching my head for hours, and what worked for me is

The answer from https://stackoverflow.com/a/33030590/1170677
I forgot to start the streaming process so you need to ssc.start()

answered Jan 29 '21 at 16:04

Adelin

15,139
20
96
143

spark ssc.textFileStream is not streamining any files from directory

7 Answers7

Linked

Related