2

I have a list of co-ordinates of certain bus stops in this format

 Bus_Stop_ID     lat          long
 A               -34.04199    18.61747
 B               -33.92312    18.44649

I then have a list of certain shops

 Shop_ID     lat          long
 1            -34.039350  18.617964  
 2            -33.927820  18.410520 

I would like to check whether the shops fall within a 500 metre radius from the bus stop. Ultimately, the final dataset would look something like this where the Bus_Stop column indicates T/F and Bus_Stop_ID shows the relevant BUS ID(s) for that shop if Bus_Stop == T -

 Shop_ID     lat          long       Bus_Stop Bus_ID  
 1            -34.039350  18.617964  TRUE     A
 2            -33.927820  18.410520  FALSE    #NA

Does anyone have an idea about how I can go about this using R? I've seen the package geosphere but have struggled to understand it given my relative inexperience in the spatial domain. Any ideas or packages you could recommend? Thank you

Allan Davids
  • 115
  • 7

2 Answers2

4

Updated to more scalable solution:

The previous answer (still included below) is not suited for large data sets. The reason is that we need to compute the distance for each pair of shops and bus. Therefore both the memory and computation scale as O(N*M) for N shops and M buses. A more scalable solution uses a data structure such as a KD-Tree to perform nearest neighbor search for each shop. The advantage here is that the computational complexity becomes O(M*logM) for building the KD-Tree for the bus stops and O(N*logM) for searching the nearest neighbor for each shop.

To do this, we can use nn2 from the RANN package. The complication here is that nn2 deals only with Euclidean distances and does not know anything about lat/long. Therefore, we need to convert the lat/long coordinates to some map projection (i.e. UTM) in order to use it correctly (i.e., in order to compute the Euclidean distance between shops and bus stops correctly).

Note: The following borrows heavily from Josh O'Brien's solutions for determining the UTM zone from a longitude and for converting lat/long to UTM, so he should take a bow.

## First define a function from Josh OBrien's answer to convert
## a longitude to its UTM zone
long2UTM <- function(long) {
  (floor((long + 180)/6) %% 60) + 1
}

## Assuming that all points are within a zone (within 6 degrees in longitude),
## we use the first shop's longitude to get the zone.
z <- long2UTM(shops[1,"long"])

library(sp)
library(rgdal)

## convert the bus lat/long coordinates to UTM for the computed zone
## using the other Josh O'Brien linked answer
bus2 <- bus
coordinates(bus2) <- c("long", "lat")
proj4string(bus2) <- CRS("+proj=longlat +datum=WGS84")

bus.xy <- spTransform(bus2, CRS(paste0("+proj=utm +zone=",z," ellps=WGS84")))

## convert the shops lat/long coordinates to UTM for the computed zone
shops2 <- shops
coordinates(shops2) <- c("long", "lat")
proj4string(shops2) <- CRS("+proj=longlat +datum=WGS84")

shops.xy <- spTransform(shops2, CRS(paste0("+proj=utm +zone=",z," ellps=WGS84")))

library(RANN)

## find the nearest neighbor in bus.xy@coords for each shops.xy@coords
res <- nn2(bus.xy@coords, shops.xy@coords, 1)
## res$nn.dist is a vector of the distance to the nearest bus.xy@coords for each shops.xy@coords
## res$nn.idx is a vector of indices to bus.xy of the nearest bus.xy@coords for each shops.xy@coords
shops$Bus_Stop <- res$nn.dists <= 500
shops$Bus_ID <- ifelse(res$nn.dists <= 500, bus[res$nn.idx,"Bus_Stop_ID"], NA)

Although more complicated, this approach is much better suited for realistic problems where you may have large numbers of shops and bus stops. Using the same supplied data:

print(shops)
##  Shop_ID       lat     long Bus_Stop Bus_ID
##1       1 -34.03935 18.61796     TRUE      A
##2       2 -33.92782 18.41052    FALSE   <NA>

You can do this using the package geosphere. Here, I'm assuming that your first data frame is named bus, and your second data frame is named shops:

library(geosphere)
g <- expand.grid(1:nrow(shops), 1:nrow(bus))
d <- matrix(distGeo(shops[g[,1],c("long","lat")], bus[g[,2],c("long","lat")]),
            nrow=nrow(shops))
shops$Bus_Stop <- apply(d, 1, function(x) any(x <= 500))
shops$Bus_ID <- bus[apply(d, 1, function(x) {
                                  c <-which(x <= 500)
                                  if(length(c)==0) NA else c[1]
                                }), "Bus_Stop_ID"]
print(shops)
##  Shop_ID       lat     long Bus_Stop Bus_ID
##1       1 -34.03935 18.61796     TRUE      A
##2       2 -33.92782 18.41052    FALSE   <NA>

Notes:

  1. We first use expand.grid to enumerate all pair combinations of shops and bus stops. These are ordered by shops first.
  2. We then compute the distance matrix d using geosphere::distGeo. Note here that the input expects (lon, lat) coordinates. distGeo returns distances in meters. The resulting d matrix is now(shops) by now(bus) so that each row gives the distance from a shop to each bus stop.
  3. We then see if there is a bus stop within 500 meters of each shop by applying the function any(x <= 500) for each row x in d using apply with MARGIN=1.
  4. Similarly, we can extract the column of d (corresponding to the row in bus) for the first shop that is within 500 meters using which instead of any in our applied function. Then use this result to select the Bus_Stop_ID from bus.

By the way, we don't have to apply the condition x <= 500 twice. The following will also work:

shops$Bus_ID <- bus[apply(d, 1, function(x) {
                                  c <-which(x <= 500)
                                  if(length(c)==0) NA else c[1]
                                }), "Bus_Stop_ID"]
shops$Bus_Stop <- !is.na(shops$Bus_ID)

and is more efficient.

Data:

bus <- structure(list(Bus_Stop_ID = structure(1:2, .Label = c("A", "B"
), class = "factor"), lat = c(-34.04199, -33.92312), long = c(18.61747, 
18.44649)), .Names = c("Bus_Stop_ID", "lat", "long"), class = "data.frame",  row.names = c(NA, 
-2L))

shops <- structure(list(Shop_ID = 1:2, lat = c(-34.03935, -33.92782), 
long = c(18.617964, 18.41052), Bus_ID = structure(c(1L, NA
), .Label = c("A", "B"), class = "factor"), Bus_Stop = c(TRUE, 
FALSE)), .Names = c("Shop_ID", "lat", "long", "Bus_ID", "Bus_Stop"
), row.names = c(NA, -2L), class = "data.frame")
Community
  • 1
  • 1
aichao
  • 6,680
  • 3
  • 11
  • 16
1

My first approach would be to just use Euclidean distance and check whether the resulting value is greater or equal 0.

You could then use an IF clause and check T/F conditions.

I hope this helps.

PS: In my imagination, a distance of 500m would be a rather flat representation of the Earth's surface, so I don't think it's needed to use some geoid packages.

nocompiler
  • 11
  • 1
  • 2
    You're right about the earth being essentially flat over such small areas. Even in this situation, though, a nice thing about **geosphere** is that it will take lat/long inputs and return distances in meters. – Josh O'Brien Sep 12 '16 at 16:22
  • 2
    @JoshO'Brien is correct. In addition, a degree in longitude is not the same distance as degree of latitude except at the equator because the Earth is an ellipsoid instead of a rectangular box. Therefore, your distance in meters need to account for that. – aichao Sep 12 '16 at 16:46