6

This question was edited after answers for show final solution I used

I have unstructured 2D datasets coming from different sources, like by example: Example data 1: 3D measurement Example data 2: 2D mesh nodes Theses datasets are 3 numpy.ndarray (X, Y coordinates and Z value).

My final aim is to interpolate theses datas on a grid for conversion to image/matrix. So, I need to find the "best grid" for interpolate theses datas. And, for this I need to find the best X and Y step between pixels of that grid.

Determinate step based on Euclidean distance between points:

Use the mean of Euclidean distances between each point and its nearest neighbour.

  • Use KDTree/cKDTree from scipy.spacial for build tree of the X,Y datas.
  • Use the query method with k=2 for get the distances (If k=1, distances are only zero because query for each point found itself).


    # Generate KD Tree
    xy = np.c_[x, y]  # X,Y data converted for use with KDTree
    tree = scipy.spacial.cKDTree(xy)  # Create KDtree for X,Y coordinates.

    # Calculate step
    distances, points = tree.query(xy, k=2)  # Query distances for X,Y points
    distances = distances[:, 1:]  # Remove k=1 zero distances
    step = numpy.mean(distances)  # Result

Performance tweaking:

  • Use of scipy.spatial.cKDTree and not scipy.spatial.KDTree because it is really faster.
  • Use balanced_tree=False with scipy.spatial.cKDTree: Big speed up in my case, but may not be true for all data.
  • Use n_jobs=-1 with cKDTree.query for use multithreading.
  • Use p=1 with cKDTree.query for use Manhattan distance in place of Euclidian distance (p=2): Faster but may be less accurate.
  • Query the distance for only a random subsample of points: Big speed up with large datasets, but may be less accurate and less repeatable.

Interpolate points on grid:

Interpolate dataset points on grid using the calculated step.



    # Generate grid
    def interval(axe):
        '''Return numpy.linspace Interval for specified axe'''
        cent = axe.min() + axe.ptp() / 2  # Interval center
        nbs = np.ceil(axe.ptp() / step)  # Number of step in interval
        hwid = nbs * step / 2  # Half interval width 
        return np.linspace(cent - hwid, cent + hwid, nbs)  # linspace

    xg, yg = np.meshgrid(interval(x), interval(y))  # Generate grid

    # Interpolate X,Y,Z datas on grid
    zg = scipy.interpolate.griddata((x, y), z, (xg, yg))

Set NaN if pixel too far from initials points:

Set NaN to pixels from grid that are too far (Distance > step) from points from initial X,Y,Z data. The previous generated KDTree is used.



    # Calculate pixel to X,Y,Z data distances
    dist, _ = tree.query(np.c_[xg.ravel(), yg.ravel()])
    dist = dist.reshape(xg.shape)

    # Set NaN value for too far pixels
    zg[dist > step] = np.nan

  • What was the problem with scipy's KDTree? – M4rtini Jan 15 '16 at 13:52
  • I tried to use it with the `query` method, but, for each point the result is itself. Other methods don't seem to be useful in my case. This seem to be done for work with 2 different coordinates sets. –  Jan 15 '16 at 14:15
  • 2
    Use `query` with k=2. The second point should then be the nearest neighbour. – M4rtini Jan 15 '16 at 14:22
  • Effectively, this work with k=2. Thanks for this tips. I'll need to read the documentation more carefully next time. –  Jan 16 '16 at 17:05
  • You added some very interesting pieces of code! See you around ;) – Antonio Ragagnin Jan 18 '16 at 20:54

2 Answers2

2

The problem you want to solve is called the "all-nearest-neighbors problem". See this article for example: http://link.springer.com/article/10.1007/BF02187718

I believe solutions to this are O(N log N), so on the same order as KDTree.query, but in practice much, much faster than a bunch of separate queries. I'm sorry, I don't know of a python implementation of this.

Alex I
  • 18,105
  • 7
  • 72
  • 135
  • Thanks for give me the real name of this problem, I'll be able to search more information on it with this. –  Jan 16 '16 at 17:13
1

I suggest you to go with KDTree.query.

You are searching of a carachteristic distance to scale your binning: I suggest you to take only a random subset of your points, and to use the Manhattan distance, becasue KDTree.query is very slow (and yet it is a n*log(n) complexity).

Here is my code:

# CreateTree
tree=scipy.spatial.KDTree(numpy.array(points)) # better give it a copy?
# Create random subsample of points
n_repr=1000
shuffled_points=numpy.array(points)
numpy.random.shuffle(shuffled_points)
shuffled_points=shuffled_points[:n_repr]
# Query the tree
(dists,points)=tree.query(shuffled_points,k=2,p=1)
# Get _extimate_ of average distance:
avg_dists=numpy.average(dists)
print('average distance Manhattan with nearest neighbour is:',avg_dists)

I suggest you to use the Manhattan distance ( https://en.wikipedia.org/wiki/Taxicab_geometry ) because it is was faster to compute than the euclidean distance. And since you need only an estimator of the average distance it should be sufficient.

Antonio Ragagnin
  • 2,046
  • 1
  • 18
  • 32