I probably understand why people down-vote this question: because there are hardly any good answers from years of related posts, and the problem is painfully complex. More to the point, though:
Unfortunately image information dimensionality is higher than 2D. The photo you link is just a projection from high-dimensional space to a plane, and not necessarily representative of how the actual space looks like. This specific projection is mostly about colors, isn't it?
The same applies to trying to squeeze clusters into folders - in most cases it is impossible.
Yet you are correctly pointing to specific comparison dimensions in your question:
how can I segregate (on 2D plane, into groups, folders, whatever) images based on their colors and shape properties?
The solution is to focus on the similarity dimension/metric of your interest. E.g. specifically "does this image contain a circle?", and optimize for this. But if you want a "square", you are already in another dimension. If optimizing for color, you can look at "overall redness". The more metrics you add, the higher is your clustering dimensionality.
Our perception is like this. We aim at specific summary metric, e.g. a scalar value, which is a sum of weighted metrics in different dimensions (ranking problem). For example, if you want photos with "eyes", you do not care about color variations. And vice versa, if you care more about colors, shapes are less important.
From my experience, clustering is easier when image clusters are relatively uniform. E.g. pictures in each potential cluster are very similar by some metric. For example one group is "bridges", another "faces". If you have very diverse images of any possible subject (even noise), the solution is intractable, unless you specify what exactly you want to group by.