1

I have a data frame as shown below. How can I replace the NA? Looks like for the same device, source and timestamp the lat and long are same as well. So, the NA values should be: aaa, US, 33.23, -117.31 for the 2nd row. Am I thinking the right way? If not, then what is the best way to replace those missing values and in a correct way?

city  country   lat    long        timestamp           source    device 
aaa     US    33.23  -117.31    2015-04-12 11:51:16     ad-f        w                  
<NA>   <NA>    NA      NA       2015-04-12 11:51:16     ad-f        w              
bbb     US    33.78  -117.96    2015-04-12 11:51:16     se-f        m                         
bbb     US    33.78  -117.96    2015-04-12 11:51:16     se-f        m                             
aaa     US    33.23  -117.31    2015-04-12 11:51:16     ad-f        w     
<NA>   <NA>    NA      NA       2015-04-04 17:38:26     se-f        m              
<NA>   <NA>    NA      NA       2015-04-04 17:38:26     ad-g        m              
ddd     US    33.99  -118.09    2015-04-04 17:38:26     ad-g        m              
eee     US    40.67  -73.94     2015-04-04 17:38:26     se-f        m              
<NA>   <NA>    NA      NA       2015-03-02 00:04:12     ad-g        w          

Thanks in advance.

I. Ara
  • 37
  • 1
  • 4

2 Answers2

3

Try fill, which is part of the tidyr package. (Like dplyr, it's part of the tidyverse suite of packages).

df <- df %>%
  group_by(device, source, timestamp) %>%
  fill(city, country, lat, long)
Melissa Key
  • 3,846
  • 7
  • 21
  • 1
    Good answer, but I guess one caveat that OP should know is that ```fill``` is conditional on *previous entry*, which could not always be the case if the data is not well arranged. I would do ```arrange``` first if ```fill``` is to be used. – Kim May 02 '18 at 00:27
  • 1
    added. good point @Kim! – Melissa Key May 02 '18 at 00:30
  • It's not working for my problem. I've to find out the correct lat, long etc. fill is based on previous entry. lat, long have to be picked from the group of timestamp, it should not take lat, long from previous/post slot of timestamp. – I. Ara May 02 '18 at 04:21
  • I edited my code - (and I should have done it this way from the beginning). Instead of arranging the data, we need to group by those items so that only entries with the appropriate device, source, and timestamp are considered when filling in the NA's. – Melissa Key May 02 '18 at 04:27
  • (you might consider an `arrange` statement too which moves the NAs to the end, if there are cases where the first entry is NA, but the data is available later - but I can't see your data to figure out all the cases you need to consider) – Melissa Key May 02 '18 at 04:29
  • I did it but still it's not working. First of all it takes so much time to execute. I got big data size, so it took more than 15 minutes. When I checked the data, half of them still NA. For lat and long it's truncating one digit after decimal point for the replacement of NA as well as the existing data. Anyway, I'm going to add more data if you have any suggestion. – I. Ara May 02 '18 at 20:21
1

From zoo

zoo::na.locf(df)
  city country   lat    long          timestamp source device
1  aaa      US 33.23 -117.31 2015-04-1211:51:16     ad      w
2  aaa      US 33.23 -117.31 2015-04-1211:51:16     ad      w
3  bbb      US 33.78 -117.96 2015-04-1211:51:16     se      m
4  bbb      US 33.78 -117.96 2015-04-1211:51:16     se      m
5  aaa      US 33.23 -117.31 2015-04-1211:51:16     ad      w
BENY
  • 258,262
  • 17
  • 121
  • 165