1

I am new to the LDA and I have three questions. I would like to classify my text (tags) with the LDA. First I filter the words, which have been used only by one user, machine tags, tags containing only digits and tags with the frequency less than 3. Then, I calculate the amount of topics with the Elbow method and there I get the memory error (this will be the third question). So the amount of topics suggested by the Elbow method is 8 (I have filtered some more tags to overcome the memory issue but I would need to apply it to bigger datasets in the future).

  1. Should I use tf-idf as a preprocessing step for the LDA? Or if I filter the "useless" tags before it doesn't make sense? I think I don't understand what is going on exactly in the LDA.

    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    lda = ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, alpha = 0.1, num_topics=8)
    corpus_lda = lda[corpus_tfidf]
    
  2. Does it make sense to validate the topics quality with the LSI? As I understand the LSI is a method for dimensionality reduction, so I use it to apply K-Means and to see if the 8 clusters of the topics actually look like clusters. But to be honest I don't really understand what exactly I am visualising.

    lsi = models.LsiModel(corpus_lda, id2word=dictionary, num_topics=2)
    lsi_coord = "file.csv"
    fcoords = codecs.open(lsi_coord,'w','utf-8')
    for vector in lsi[corpus_lda]:
        if len(vector) != 2:
        continue
        fcoords.writelines("%6.12f\t%6.12f\n" % (vector[0][1],vector[1][1]))
    fcoords.close()
    num_topics = 8
    X = np.loadtxt(lsi_coord, delimiter="\t")
    my_kmeans = KMeans(num_topics).fit(X)
    k_means_labels = my_kmeans.labels_
    k_means_cluster_centers = my_kmeans.cluster_centers_
    colors = ['b','g','r','c','m','y','k','greenyellow']
    for k, col in zip(range(num_topics), colors):
    my_members = k_means_labels == k
    plt.scatter(X[my_members, 0], X[my_members, 1], s=30, c=colors[k], zorder=10)
    cluster_center = k_means_cluster_centers[k]
    plt.scatter(cluster_center[0], cluster_center[1], marker='x', s=30, linewidths=3, color='r', zorder=10)
    plt.title('K-means clustering')
    plt.show()
    
  3. Memory issues. I am trying to create a matrix which has values for every unique term. So if the term is not in the document it gets zero. So it is a sparse matrix, because I have around 1300 unique terms and every document has about 5. And the memory issue arise at the converting to np.array. I guess I have to optimize the matrix somehow.

     # creating term-by-document matrix
    Y = []
    for z in corpus_lda:
        Y1=[]
        temp_dict={}
        for g in z:
            temp_dict.update({g[0]:g[1]})
         counter=0
         while counter < len(dictionary.keys()):
            if counter in temp_dict.keys():
                Y1.append(temp_dict[counter])
            else:
                Y1.append(0)
            counter+=1
        Y.append(Y1)
    Y = np.array(Y)
    

The following code I took from here : Calculating the percentage of variance measure for k-means?

    K = range(1,30) # amount of clusters 
    KM = [kmeans(Y,k) for k in K] 
    KM = []
    for k in K:
        KM_result = kmeans(Y,k)
        KM.append(KM_result)

    centroids = [cent for (cent,var) in KM]

    scipy.spatial.distance import cdist
    D_k = [cdist(Y, cent, 'euclidean') for cent in centroids] 
    cIdx = [np.argmin(D,axis=1) for D in D_k]
    dist = [np.min(D,axis=1) for D in D_k]
    avgWithinSS = [sum(d)/Y.shape[0] for d in dist]  
    kIdx = 8

    # elbow curve
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.plot(K, avgWithinSS, 'b*-')
    ax.plot(K[kIdx], avgWithinSS[kIdx], marker='o', markersize=12, markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
    plt.grid(True)
    plt.xlabel('Number of clusters')
    plt.ylabel('Average within-cluster sum of squares')
    plt.title('Elbow for KMeans clustering')

Any ideas for any of the questions are highly appreciated!

Community
  • 1
  • 1
student
  • 441
  • 1
  • 4
  • 14

0 Answers0