Updated to more scalable solution:
The previous answer (still included below) is not suited for large data sets. The reason is that we need to compute the distance for each pair of shops
and bus
. Therefore both the memory and computation scale as O(N*M)
for N
shops and M
buses. A more scalable solution uses a data structure such as a KD-Tree to perform nearest neighbor search for each shop. The advantage here is that the computational complexity becomes O(M*logM)
for building the KD-Tree for the bus stops and O(N*logM)
for searching the nearest neighbor for each shop.
To do this, we can use nn2
from the RANN
package. The complication here is that nn2
deals only with Euclidean distances and does not know anything about lat/long. Therefore, we need to convert the lat/long coordinates to some map projection (i.e. UTM) in order to use it correctly (i.e., in order to compute the Euclidean distance between shops and bus stops correctly).
Note: The following borrows heavily from Josh O'Brien's solutions for determining the UTM zone from a longitude and for converting lat/long to UTM, so he should take a bow.
## First define a function from Josh OBrien's answer to convert
## a longitude to its UTM zone
long2UTM <- function(long) {
(floor((long + 180)/6) %% 60) + 1
}
## Assuming that all points are within a zone (within 6 degrees in longitude),
## we use the first shop's longitude to get the zone.
z <- long2UTM(shops[1,"long"])
library(sp)
library(rgdal)
## convert the bus lat/long coordinates to UTM for the computed zone
## using the other Josh O'Brien linked answer
bus2 <- bus
coordinates(bus2) <- c("long", "lat")
proj4string(bus2) <- CRS("+proj=longlat +datum=WGS84")
bus.xy <- spTransform(bus2, CRS(paste0("+proj=utm +zone=",z," ellps=WGS84")))
## convert the shops lat/long coordinates to UTM for the computed zone
shops2 <- shops
coordinates(shops2) <- c("long", "lat")
proj4string(shops2) <- CRS("+proj=longlat +datum=WGS84")
shops.xy <- spTransform(shops2, CRS(paste0("+proj=utm +zone=",z," ellps=WGS84")))
library(RANN)
## find the nearest neighbor in bus.xy@coords for each shops.xy@coords
res <- nn2(bus.xy@coords, shops.xy@coords, 1)
## res$nn.dist is a vector of the distance to the nearest bus.xy@coords for each shops.xy@coords
## res$nn.idx is a vector of indices to bus.xy of the nearest bus.xy@coords for each shops.xy@coords
shops$Bus_Stop <- res$nn.dists <= 500
shops$Bus_ID <- ifelse(res$nn.dists <= 500, bus[res$nn.idx,"Bus_Stop_ID"], NA)
Although more complicated, this approach is much better suited for realistic problems where you may have large numbers of shops and bus stops. Using the same supplied data:
print(shops)
## Shop_ID lat long Bus_Stop Bus_ID
##1 1 -34.03935 18.61796 TRUE A
##2 2 -33.92782 18.41052 FALSE <NA>
You can do this using the package geosphere
. Here, I'm assuming that your first data frame is named bus
, and your second data frame is named shops
:
library(geosphere)
g <- expand.grid(1:nrow(shops), 1:nrow(bus))
d <- matrix(distGeo(shops[g[,1],c("long","lat")], bus[g[,2],c("long","lat")]),
nrow=nrow(shops))
shops$Bus_Stop <- apply(d, 1, function(x) any(x <= 500))
shops$Bus_ID <- bus[apply(d, 1, function(x) {
c <-which(x <= 500)
if(length(c)==0) NA else c[1]
}), "Bus_Stop_ID"]
print(shops)
## Shop_ID lat long Bus_Stop Bus_ID
##1 1 -34.03935 18.61796 TRUE A
##2 2 -33.92782 18.41052 FALSE <NA>
Notes:
- We first use
expand.grid
to enumerate all pair combinations of shops
and bus
stops. These are ordered by shops
first.
- We then compute the distance matrix
d
using geosphere::distGeo
. Note here that the input expects (lon, lat) coordinates. distGeo
returns distances in meters. The resulting d
matrix is now(shops)
by now(bus)
so that each row gives the distance from a shop to each bus stop.
- We then see if there is a bus stop within 500 meters of each shop by applying the function
any(x <= 500)
for each row x
in d
using apply
with MARGIN=1
.
- Similarly, we can extract the column of
d
(corresponding to the row in bus
) for the first shop that is within 500 meters using which
instead of any
in our applied function. Then use this result to select the Bus_Stop_ID
from bus
.
By the way, we don't have to apply
the condition x <= 500
twice. The following will also work:
shops$Bus_ID <- bus[apply(d, 1, function(x) {
c <-which(x <= 500)
if(length(c)==0) NA else c[1]
}), "Bus_Stop_ID"]
shops$Bus_Stop <- !is.na(shops$Bus_ID)
and is more efficient.
Data:
bus <- structure(list(Bus_Stop_ID = structure(1:2, .Label = c("A", "B"
), class = "factor"), lat = c(-34.04199, -33.92312), long = c(18.61747,
18.44649)), .Names = c("Bus_Stop_ID", "lat", "long"), class = "data.frame", row.names = c(NA,
-2L))
shops <- structure(list(Shop_ID = 1:2, lat = c(-34.03935, -33.92782),
long = c(18.617964, 18.41052), Bus_ID = structure(c(1L, NA
), .Label = c("A", "B"), class = "factor"), Bus_Stop = c(TRUE,
FALSE)), .Names = c("Shop_ID", "lat", "long", "Bus_ID", "Bus_Stop"
), row.names = c(NA, -2L), class = "data.frame")