The basic issue is that randomly chosen points are not necessarily linearly classifiable.
However, there is a worse problem in the algorithm:
Even if you go by a good reference like Vapnik's, "Statistical Learning Theory", you are not given the biggest problem in the OP's algorithm. The issue is NOT the learning rate parameter, either. It is trivial to prove that the learning rate parameter has no real effect on whether or not the algorithm converges - this is because the learning rate parameter is simply a scalar.
Imagine a set of four points to classify. The four points are orthogonal vectors of length sqrt(2). The "in class" point is (-1,1). The "out of class" points are (-1,-1), (1,1), and (1,-1). Regardless of any learning rate, the OP's original algorithm will never converge.
The reason why the original poster's algorithm fails to converge is because of the missing bias term (effectively the coefficient of the 0th dimensional term), which MUST augment the other dimensional terms. Without the bias term, the perceptron is not completely defined. This is trivial to prove by hand modeling a perceptron in 1D or 2D space.
The bias term is often hand-waved in literature as a way to "shift" the hyperplane, but it seems that's mostly due to the fact that many people tend to teach/learn perceptrons in 2D space for teaching purposes. The term "Shifting" doesn't provide an adequate explanation of why the bias term is needed in high dimensional spaces (what does it mean to "shift" a 100-dimensional hyperplane?)
One may notice that literature proving the mean convergence time of the perceptron excludes the bias term. This is due to the ability to simplify the perceptron equation if you assume that the perceptron will converge (see Vapnik, 1998, Wiley, p.377). That is a big (but necessary) assumption for the proof, but one cannot adequately implement a perceptron by assuming an incomplete implementation.
Alex B. J. Novikoff's 1962/1963 proofs of the perceptron convergence include this zero dimensional term.