Reshape data to find number of missing record/value in each category based on reference variable in R

Question

I have census survey data at the household level. The data structure is like this: first household size is recorded and based on that date of birth, education level, work status, and other individual demographic information is been collected for each member.

Here is a dummy sample for household up to 4 members and only date of birth and working status (real data has household up to 10 members and 8 demos)

df <- tibble::tribble(
  ~id, ~House_member, ~dob_1, ~dob_2, ~dob_3, ~dob_4, ~work_1, ~work_2, ~work_3, ~work_4,
   1L,            4L,  1983L,  1980L,  2009L,     NA,      2L,        2L,      NA,      NA,
   2L,            1L,  1940L,     NA,     NA,     NA,      9L,        NA,      NA,      NA,
   3L,            2L,  1951L,  1951L,     NA,     NA,      9L,        9L,      NA,      NA,
   4L,            4L,  1965L,  1973L,  2002L,     NA,      2L,        2L,      8L,      2L,
   5L,            3L,  1965L,  1948L,  2006L,     NA,      2L,        9L,      NA,      NA,
   6L,            1L,  1951L,     NA,     NA,     NA,      9L,        NA,      NA,      NA,
   7L,            1L,  1955L,     NA,     NA,     NA,     10L,        NA,      NA,      NA,
   8L,            4L,  1982L,  1978L,  2008L,     NA,      2L,        2L,      NA,      NA,
   9L,            2L,  1990L,  1997L,     NA,     NA,      2L,        8L,      NA,      NA,
  10L,            2L,  1953L,  1957L,     NA,     NA,      2L,        2L,      NA,      NA
  )


df 
# A tibble: 10 x 10
      id House_member dob_1 dob_2 dob_3 dob_4 work_1 work_2 work_3 work_4
   <int>        <int> <int> <int> <int> <lgl>  <int>    <int>  <int> <lgl> 
 1     1            4  1983  1980  2009 NA         2        2     NA NA    
 2     2            1  1940    NA    NA NA         9       NA     NA NA    
 3     3            2  1951  1951    NA NA         9        9     NA NA    
 4     4            4  1965  1973  2002 NA         2        2      8 2    
 5     5            3  1965  1948  2006 NA         2        9     NA NA    
 6     6            1  1951    NA    NA NA         9       NA     NA NA    
 7     7            1  1955    NA    NA NA        10       NA     NA NA    
 8     8            4  1982  1978  2008 NA         2        2     NA NA    
 9     9            2  1990  1997    NA NA         2        8     NA NA    
10    10            2  1953  1957    NA NA         2        2     NA NA

I am looking for a way to find how many households for each category (date of birth, work status) has some missing member information and summarise/report it. I am not sure what is the best way to extract this kind of insight from this data structure

For example looking at sample data, id == 1 is 4 members household but dob_4 is missing (work_4 is missing for this HH) same issue for id = 4 and dob_4 etc.

I found this old post but this is exactly what I am looking for.

What is the reason for the `work_1_2` column? i.e. why is it not `work_2`??? — dario, Feb 12 '21 at 13:09
Also note that your data actually has data for `work_4` for `id = 4` but you have not shown it below. — Ronak Shah, Feb 12 '21 at 13:13

Ronak Shah · Answer 1 · 2021-02-12T13:55:41.167

Reshape the data in longer format. Drop the NA values by using values_drop_na = TRUE. Now we can find out the missing information by subtracting House_member value with number of rows in each id.

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(cols = matches('dob|work'), 
              names_to = c('.value', 'member_num'), 
              names_sep = '_',
              values_drop_na = TRUE) 
  group_by(id) %>%
  summarise(missing_info = first(House_member) - n())

#      id missing_info
# * <int>        <int>
# 1     1            1
# 2     2            0
# 3     3            0
# 4     4            0
# 5     5            0
# 6     6            0
# 7     7            0
# 8     8            1
# 9     9            0
#10    10            0

To get missing values for dob and work separately we can do :

df %>%
  pivot_longer(cols = matches('dob|work'), 
               names_to = c('.value', 'member_num'), 
               names_sep = '_', 
               values_drop_na = TRUE) %>%
  group_by(id) %>%
  summarise(across(c(dob, work), ~first(House_member) - n() + sum(is.na(.)), 
            .names = 'missing_{col}'))

#     id  missing_date missing_work
# * <int>        <int>        <int>
# 1     1            1            2
# 2     2            0            0
# 3     3            0            0
# 4     4            1            0
# 5     5            0            1
# 6     6            0            0
# 7     7            0            0
# 8     8            1            2
# 9     9            0            0
#10    10            0            0

Thanks @ronak with your approach at the end we only have missing_info per id. We have no clue if date of birth is missing or work etc unless we use it one by one ex: matches('dob') for each category.correct? — DanG, Feb 12 '21 at 13:19
Added an update to find out missing date or work for each `id` @DanielG . — Ronak Shah, Feb 12 '21 at 13:58

score 1 · Accepted Answer · answered Feb 12 '21 at 13:31

This extended variant of the dplyr + tidyr solution gives the id and person_id of the missing values:

df <- tibble::tribble(
  ~id, ~House_member, ~dob_1, ~dob_2, ~dob_3, ~dob_4, ~work_1, ~work_2, ~work_3, ~work_4,
  1L,            4L,  1983L,  1980L,  2009L,     NA,      2L,        2L,      NA,      NA,
  2L,            1L,  1940L,     NA,     NA,     NA,      9L,        NA,      NA,      NA,
  3L,            2L,  1951L,  1951L,     NA,     NA,      9L,        9L,      NA,      NA,
  4L,            4L,  1965L,  1973L,  2002L,     NA,      2L,        2L,      8L,      2L,
  5L,            3L,  1965L,  1948L,  2006L,     NA,      2L,        9L,      NA,      NA,
  6L,            1L,  1951L,     NA,     NA,     NA,      9L,        NA,      NA,      NA,
  7L,            1L,  1955L,     NA,     NA,     NA,     10L,        NA,      NA,      NA,
  8L,            4L,  1982L,  1978L,  2008L,     NA,      2L,        2L,      NA,      NA,
  9L,            2L,  1990L,  1997L,     NA,     NA,      2L,        8L,      NA,      NA,
  10L,            2L,  1953L,  1957L,     NA,     NA,      2L,        2L,      NA,      NA
)



library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(cols = matches("dob|work"),
               names_to = c("name", "person"),
               names_sep = "_") %>% 
  group_by(id, name) %>% 
  filter(person <= max(House_member)) %>% 
  filter(is.na(value)) %>% 
  select(id, name, person) %>% 
  arrange(id, name)

Returns:

# A tibble: 8 x 3
# Groups:   id, name [6]
  id name  person
  <int> <chr> <chr> 
1     1 dob   4     
2     1 work  3     
3     1 work  4     
4     4 dob   4     
5     5 work  3     
6     8 dob   4     
7     8 work  3     
8     8 work  4

Reshape data to find number of missing record/value in each category based on reference variable in R

2 Answers2