1

My bigram language model works fine when one word is given in input, but when I give two words to my trigram model, it behaves strangely and predicts 'unknown' as the next word. My code:

def get_unigram_probability(word):
  if word not in unigram:
      return 0
  return unigram[word] / total_words
    
def get_bigram_probability(words):
  if words not in bigram:
      return 0
  return bigram[words] / unigram[words[0]]
    
V = len(vocabulary)

def get_trigram_probability(words):
  if words not in trigram:
      return 0
  return trigram[words] + 1 / bigram[words[:2]] + V
  

for bi-gram next word prediction:

def find_next_word_bigram(words):
  candidate_list = []

  # Calculate probability for each word by looping through them
  for word in vocabulary:
    p2 = get_bigram_probability((words[-1], word))
    candidate_list.append((word, p2))
    
  # sort the list with words with often occurence in the beginning
  candidate_list.sort(key=lambda x: x[1], reverse=True)
  # print(candidate_list)
  return candidate_list[0]

for trigram:

def find_next_word_trigram(words):
  candidate_list = []

  # Calculate probability for each word by looping through them
  for word in vocabulary:
    p3 = get_trigram_probability((words[-2], words[-1], word)) if len(words) >= 3 else 0
    candidate_list.append((word, p3))
    
  # sort the list with words with often occurence in the beginning
  candidate_list.sort(key=lambda x: x[1], reverse=True)
  # print(candidate_list)
  return candidate_list[0]

I just want to know where in the code should I make changes, so that trigram would predict the next word with a given input size of 2 words.

mtekin
  • 19
  • 3

1 Answers1

0

When you build your trigrams, use a special BOS (beginning of sentence) token so you can handle short sequences. Basically before each sentence add BOS twice, like so:

I like cheese
BOS BOS I like cheese

This way when you take input from the user you can prepend BOS BOS to it and be able to complete even short sequences.

polm23
  • 7,082
  • 6
  • 25
  • 45