6

Is there a way to read a multi-line csv file using the ReadFromText transform in Python? I have a file that contains one line I am trying to make Apache Beam read the input as one line, but cannot get it to work.

def print_each_line(line):
    print line

path = './input/testfile.csv'
# Here are the contents of testfile.csv
# foo,bar,"blah blah
# more blah blah",baz

p = apache_beam.Pipeline()

(p
 | 'ReadFromFile' >> apache_beam.io.ReadFromText(path)
 | 'PrintEachLine' >> apache_beam.FlatMap(lambda line: print_each_line(line))
 )

# Here is the output:
# foo,bar,"blah blah
# more blah blah",baz

The above code parses the input as two lines even though the standard for multi-line csv files is to wrap multi-line elements within double-quotes.

Brandon
  • 175
  • 10

3 Answers3

2

Beam doesn't support parsing CSV files. You can however use Python's csv.reader. Here's an example:

import apache_beam
import csv

def print_each_line(line):
  print line

p = apache_beam.Pipeline()

(p 
 | apache_beam.Create(["test.csv"])
 | apache_beam.FlatMap(lambda filename:
     csv.reader(apache_beam.io.filesystems.FileSystems.open(filename)))
 | apache_beam.FlatMap(print_each_line))

p.run()

Output:

['foo', 'bar', 'blah blah\nmore blah blah', 'baz']
Udi Meiri
  • 953
  • 7
  • 14
0

ReadFromText parses a text file as newline-delimited elements. So ReadFromText treats two lines as two elements. If you would like to have the contents of the file as a single element, you could do the following:

contents = []
contents.append(open(path).read()) 
p = apache_beam.Pipeline()
p | beam.Create(contents)
Arjun Kay
  • 268
  • 1
  • 11
0

None of the answers worked for me but this did

(
  p
  | beam.Create(['data/test.csv'])
  | beam.FlatMap(lambda filename:
    csv.reader(io.TextIOWrapper(beam.io.filesystems.FileSystems.open(known_args.input)))
  | "Take only name" >> beam.Map(lambda x: x[0])
  | WriteToText(known_args.output)
)
Juan Acevedo
  • 1,507
  • 2
  • 18
  • 37