mapping similar text strings in between two pandas dataframes

Question

I have dataset named data_feed contains feedbacks given as:

feedback                                                    
Fast Delivery. Always before time.Thanks                     
I have order brown shoe .And I got olive green shoe          
Delivery guy is a decent nd friendly guy                     
Its really good .. my daughter loves it                      
One t shirt was fully crushed rest everything is good        
Superfast delivery! I'm impressed.                           
.........................                                    .
........................                                     .
so on

and a another dataset named reference as:-

refer_feedback               sub-category           category   sentiment
The delivery was on time.   delivery speed          delivery   positive
he was polite enough        delivery man behaviour  delivery   positive
worst products              product quality         general    negative

Now I want to extend dataset datafeed with columns as:-

feedback  sub-category   category   sentiment

How can I match similar feedbacks i.e I want to match column feedback in dataframe data_feed with column refer_feedback in dataframe reference and give corresponding labels to subcategory, category and sentiment.

for ex- first feedback in dataset data_feed is quite similar to first feedback of dataset reference then first observation for data_feed would be:

feedback                                  subcategory     category     sentiment                                   
Fast Delivery. Always before time.Thanks  delivery speed  delivery   positive

The scope of this problem is huge - much larger than an SO answer IMHO. — Ami Tavory, May 13 '18 at 19:47
@James I wrote about an approach below. If you have any questions, let me know. — Nathan, May 14 '18 at 12:41

Nathan · Answer 1 · 2018-05-13T22:12:47.740

One strategy you could use is to analyze the feedback with LDA to discover common topics. You could then use the topics to map like to like between the two tables.

LDA analyzes what is referred to as a 'corpus' of documents. Document is used abstractly here. Each example of refer_feedback or feedback could form a corpus.

Two differing approaches that could work follow:

Corpus from `refer_feedback`

Each example of refer_feedback will be a document in your corpus for this approach. The number of topics you are looking for is equal to the count of unique subcategories.

Use nltk to remove stop words and perform lemmatisation. Use gensim to perform LDA on the results to get your topics model. Use this topics model to classify feedback as it comes in.

Corpus from `feedback`

If you do not have enough refer_feedback examples or you try the first approach and it does not work, try building a corpus from a large set of feedback examples. In this approach, the number of topics is not as easy to determine but it would be valuable to start with something close to the number of subcategories that you have.

Use ntlk again to remove stop words and perform lemmatisation. Build the LDA model.

Next, you will need to manually map the topics generated by the model to subcategories. Save this mapping.

When future feedback comes in, use the ldamodel to discover its most probable topics then use your mapping of topic to subcategory to assign the appropriate fields.

mapping similar text strings in between two pandas dataframes

1 Answers1

Corpus from refer_feedback

Corpus from feedback

Corpus from `refer_feedback`

Corpus from `feedback`