0

I'm using spark on google cloud platform. Apparently I'm reading a file from the filesystem gs://<bucket>/dir/file, but the log output prompts

FileNotFoundException: `gs:/bucket/dir/file (No such file or dir exist)

The missing / is obviously the problem. How can I solve this?

error message

This is my code:

val files = Array(("call 1","gs://<bucket>/google-cloud-dataproc-metainfo/test/123.wav"))
val splitAudioFiles = sc.parallelize(files.map(x => splitAudio(x, 5, sc)))

def splitAudio(path: (String, String), interval: Int, sc: SparkContext): (String, Seq[(String,Int)]) = {
   val stopWords = sc.broadcast(loadTxtAsSet("gs://<bucket>/google-cloud-dataproc-metainfo/test/stopword.txt")).value
   val keyWords = sc.broadcast(loadTxtAsSet("gs://<bucket>/google-cloud-dataproc-metainfo/test/KeywordList.txt")).value

   val file = new File((path._2))
   val audioTitle = path._1
   val fileFormat: AudioFileFormat = AudioSystem.getAudioFileFormat(file)
   val format = fileFormat.getFormat

1 Answers1

2

It appears you're making use of AudioSystem.getAudioFileFormat(URL), which does not support gs:// URIs. Instead, you'll need to use the Hadoop FileSystem interface to acquire an InputStream for the file and make use of AudioSystem.getAudioFileFormat(InputStream). I think something like this should work:

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;

val sc: SparkContext = ...
val urls : RDD[URL] = ...
val formats : RDD[AudioFileFormat] = urls.map(url => {
    val asUri = url.toURI  
    val conf = new Configuration()
    val hadoopPath = new Path(asUri)
    val hadoopFs = hadooPath.getFileSystem(conf)
    val inputStream = hadoopFs.open(hadoopPath)
    AudioSystem.getAudioFileFormat(inputStream)
})
Angus Davis
  • 2,584
  • 10
  • 20
  • As an addendum, if you don't have easy access to `sc.hadoopConfiguration` (for example, if you're opening the files from inside worker tasks) then it's also fine to just do `hadoopPath.getFileSystem(new Configuration())`, since `Configuration` loads resources appropriately based on the various classpaths configured. – Dennis Huo Mar 05 '16 at 01:07
  • Good catch, Dennis. Updated the answer to use new Configuration() – Angus Davis Mar 05 '16 at 21:17