3

For a project at Uni, I'm working on the implementation of a Question Answering (bAbI dataset Task 5 at the moment, see https://research.fb.com/downloads/babi/) system with Neural Nets in TensorFlow, and I want to use TFRecords for my Input Pipeline.

My idea is that one Example in TFRecords terms should consist of the context for the question, the question itself, the answer, and the supporting sentence number (int which points to the most important sentence in the context to be able to answer the question). Here is how I've defined the function:

def make_example(context, question, answer, support):
 ex = tf.train.SequenceExample()

 fl_context = ex.feature_lists.feature_list["context"]
 fl_question = ex.feature_lists.feature_list["question"]
 fl_answer = ex.feature_lists.feature_list["answer"]
 ex.context.feature["support"].int64_list.value.append(support)

 for token in context:
    fl_context.feature.add().int64_list.value.append(token)
 for qWord in question:
    fl_question.feature.add().int64_list.value.append(qWord)
 for ansWord in answer:
    fl_answer.feature.add().int64_list.value.append(ansWord)
 fl_support.feature.add().int64_list.value.append(support)   

return ex

However, before passing the context, question, and answer, I want to embed the words and represent them by their GloVe vectors, i.e. by a (m,d) matrix, where m is the number of tokens in the sentence, and d is the number of dimensions each word vector has. This seems not to be handled well by my make_example function as I get:

theTypeError: (array([[ -9.58490000e-01,   1.73210000e-01,   
2.51650000e-01,
 -5.61450000e-01,  -1.21440000e-01,   1.54350000e+00,
 -1.28930000e+00,  -9.77790000e-01,  -1.35480000e-01,
 -6.06930000e-01,  -1.37810000e+00,   6.33470000e-01,
  1.33160000e-01,   2.46320000e-01,   6.60260000e-01,
 -4.46130000e-02,   4.09510000e-01,  -7.61670000e-01,
  4.67530000e-01,  -6.67810000e-01,   2.99850000e-01,
 -2.74810000e-01,  -5.47990000e-01,  -8.56820000e-01,
  5.30880000e-02,  -2.01700000e+00,   7.48530000e-01,
 -1.27830000e-01,   1.32050000e-01,  -2.19450000e-01,
  2.29830000e+00,  -3.17680000e-01,  -8.64940000e-01,
 -1.08630000e-01,  -8.13770000e-02,  -7.03420000e-01,
  4.60000000e-01,  -3.34730000e-01,   4.37030000e-02,
 -7.55080000e-01,  -6.89710000e-01,   7.14380000e-01,
 -8.35950000e-02,   1.58620000e-02,  -5.23850000e-01,
  1.72520000e-01,  -4.98740000e-01,   2.30810000e-01,
 -3.64690000e-01,   1.5 has type <class 'tuple'>, but expected one of: 
(<class 'int'>,)

Pointing to the fl_context.feature.add().int64_list.value.append(token) above... Could someone point out where I've misunderstood the concept of TFRecords, and give me an advice how to approach the problem?
I've searched a lot for learning materials, but usually the examples on TFRecords are with image data. So far my references are https://medium.com/@TalPerry/getting-text-into-tensorflow-with-the-dataset-api-ffb832c8bec6 and http://web.stanford.edu/class/cs20si/lectures/notes_09.pdf .

Thanks a lot in advance!

SimonAda
  • 167
  • 9

1 Answers1

1

The solution to my question can be found here: https://github.com/simonada/q-and-a-tensorflow/blob/master/src/Q%26A%20with%20TF-%20TFRecords%20and%20Eager%20Execution.ipynb

My approach is as following:

  1. Store the texts into a csv file: per row (context, question, answer)

  2. Define a function to convert sequence to tf_example, in my case

    def sequence_to_tf_example(context, question, answer):
        context_ids= vectorize(context, False, word_to_index)
        question_ids= vectorize(question, False, word_to_index)
        answer_ids= vectorize(answer, True, word_to_index)
        ex = tf.train.SequenceExample()
    
        context_tokens = ex.feature_lists.feature_list["context"]
        question_tokens = ex.feature_lists.feature_list["question"]
        answer_tokens = ex.feature_lists.feature_list["answer"]
    
        for token in context_ids:
            context_tokens.feature.add().int64_list.value.append(token)
        for token in question_ids:
            question_tokens.feature.add().int64_list.value.append(token)
        for token in answer_ids:
            #print(token)
            answer_tokens.feature.add().int64_list.value.append(token)
    
        return ex
    
  3. Define write functions

    def write_example_to_tfrecord(context, question, answer, tfrecord_file, writer):
          example= sequence_to_tf_example(context, question, answer)
          writer.write(example.SerializeToString())
    
    def write_data_to_tf_record(filename):
        file_csv= filename+'.csv'
        file_tfrecords= filename+'.tfrecords'
        with open(file_csv) as csvfile:
           readCSV = csv.reader(csvfile, delimiter=',')
           next(readCSV) #skip header
           writer= tf.python_io.TFRecordWriter(file_tfrecords)
           for row in readCSV:
           write_example_to_tfrecord(row[0], row[1], row[2], file_tfrecords, writer)
           writer.close()
    
  4. Define read functions

    def read_from_tfrecord(ex):
    
       sequence_features = {
         "context": tf.FixedLenSequenceFeature([], dtype=tf.int64),
         "question": tf.FixedLenSequenceFeature([], dtype=tf.int64),
         "answer": tf.FixedLenSequenceFeature([], dtype=tf.int64)
     }
    
    # Parse the example (returns a dictionary of tensors)
    _, sequence_parsed = tf.parse_single_sequence_example(
        serialized=ex,
        sequence_features=sequence_features
    )
    
    return {"context": sequence_parsed['context'], "question": sequence_parsed['question'],
            "answer": sequence_parsed['answer']}
    
  5. Create dataset

    def make_dataset(path, batch_size=128):
      '''
      Makes  a Tensorflow dataset that is shuffled, batched and parsed.
      '''
       # Read a tf record file. This makes a dataset of raw TFRecords
       dataset = tf.data.TFRecordDataset([path])
       # Apply/map the parse function to every record. Now the dataset is a bunch of dictionaries of Tensors
       dataset =  dataset.map(read_from_tfrecord)
       #Shuffle the dataset
       dataset = dataset.shuffle(buffer_size=10000)
    
    # specify padding for each tensor seperatly
     dataset = dataset.padded_batch(batch_size, padded_shapes={
        "context": tf.TensorShape([None]), 
        "question": tf.TensorShape([None]), 
        "answer": tf.TensorShape([None]) 
    })
    
    return dataset
    
SimonAda
  • 167
  • 9