Here is an example of doing sequence classification using a model to determine if two sequences are paraphrases of each other. The two examples give two different results. Can you help me explain why tokenizer.encode
and tokenizer.encode_plus
give different results?
Example 1 (with .encode_plus()
):
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt")
paraphrase_classification_logits = model(**paraphrase)[0]
not_paraphrase_classification_logits = model(**not_paraphrase)[0]
Example 2 (with .encode()
):
paraphrase = tokenizer.encode(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode(sequence_0, sequence_1, return_tensors="pt")
paraphrase_classification_logits = model(paraphrase)[0]
not_paraphrase_classification_logits = model(not_paraphrase)[0]