0

I am using topic_.set_value(each_topic, word, prob) to change the value of cells in a pandas dataframe. Basically, I initialized a numpy array with a certain shape and converted it to a pandas dataframe. I am then replacing these zeros by iterating over all the columns and rows using the code above. The problem is that the number of cells are around 50,000 and every time I set the value pandas prints the array to the console. I want to suppress this behavior. Any ideas?

EDIT

I have two dataframes one is topic_ which is the target dataframe and tw which is the source dataframe. The topic_ is a topic by word matrix, where each cell stores the probability of a word occurring in a particular topic. I have initialized the topic_ dataframe to zero using numpy.zeros. A sample of the tw dataframe-

print(tw)
    topic_id                                     word_prob_pair
0          0  [(customer, 0.061703717964), (team, 0.01724444...
1          1  [(team, 0.0260560163563), (customer, 0.0247838...
2          2  [(customer, 0.0171786268847), (footfall, 0.012...
3          3  [(team, 0.0290787264225), (product, 0.01570401...
4          4  [(team, 0.0197917953222), (data, 0.01343226630...
5          5  [(customer, 0.0263740639141), (team, 0.0251677...
6          6  [(customer, 0.0289764173735), (team, 0.0249938...
7          7  [(client, 0.0265082412402), (want, 0.016477447...
8          8  [(customer, 0.0524006965405), (team, 0.0322975...
9          9  [(generic, 0.0373422774996), (product, 0.01834...
10        10  [(customer, 0.0305256248248), (team, 0.0241559...
11        11  [(customer, 0.0198707090364), (ad, 0.018516805...
12        12  [(team, 0.0159852971954), (customer, 0.0124540...
13        13  [(team, 0.033444510469), (store, 0.01961003290...
14        14  [(team, 0.0344793243818), (customer, 0.0210975...
15        15  [(team, 0.026416114692), (customer, 0.02041691...
16        16  [(campaign, 0.0486186973667), (team, 0.0236024...
17        17  [(customer, 0.0208270072145), (branch, 0.01757...
18        18  [(team, 0.0280889397541), (customer, 0.0127932...
19        19  [(team, 0.0297011415217), (customer, 0.0216007...

My topic_ dataframe is of the size of num_topics(which is 20) by number_of_unique_words (in the tw dataframe)

Following is the code I am using to replace each value in the topic_ dataframe

for each_topic in range(num_topics):
    a = tw['word_prob_pair'].iloc[each_topic]
    for word, prob in a:
        topic_.set_value(each_topic, word, prob)
Clock Slave
  • 6,266
  • 9
  • 55
  • 94
  • 2
    I dont understand - if use no `print` only `topic_.set_value(each_topic, word, prob)` why it is printing? Btw, why use this method? It is very slow and if check [docs](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) there is many better methods. What is your source of data? `Lists`, `numpy array` ? Can you explain more? – jezrael Feb 20 '17 at 06:32
  • @jezrael I was following the answer given in http://stackoverflow.com/questions/13842088/set-value-for-particular-cell-in-pandas-dataframe. According to the answer, set_value works the fastest. I am not using `print`. My source of data which I am using to replace the zeros in the `topic_` dataframe comes from another data frame. A row of the source data frame looks like: `[(taret_df_col_1, value_1), (taret_df_col_2, value_2), ..., (taret_df_col_n, value_n)]` I am iterating over each row of the source data frame and then over each (column, value) pair to place it in the target dataframe – Clock Slave Feb 20 '17 at 06:51
  • 1
    Hmmm, it seems there has to be better methods. Can you add [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) as input data sample and desired output? – jezrael Feb 20 '17 at 07:13
  • and I think the slow is iterating (in pandas best avoid it), although `set_value` is fastest method. – jezrael Feb 20 '17 at 07:33
  • @jezrael Added the code. Please check. Let me know if you need further description – Clock Slave Feb 20 '17 at 08:00
  • please check my solution. – jezrael Feb 20 '17 at 08:24

2 Answers2

1

just redirect the output into variable:

>>> df.set_value(index=1,col=0,value=1)
          0         1
0  0.621660 -0.400869
1  1.000000  1.585177
2  0.962754  1.725027
3  0.773112 -1.251182
4 -1.688159  2.372140
5 -0.203582  0.884673
6 -0.618678 -0.850109
>>> a=df.set_value(index=1,col=0,value=1)
>>>

To init df it's better to use this:

pd.DataFrame(np.zeros_like(pd_n), index=pd_n.index, columns=pd_n.columns)
aslavkin
  • 204
  • 1
  • 7
1

If you do not wish to create a variable ('a' in the suggestion above) then use python's throwaway variable '_'. So your statement becomes :

_ = df.set_value(index=1,col=0,value=1)
dmdip
  • 1,227
  • 10
  • 11