0

I am trying to read a text file (https://www.bls.gov/bdm/us_age_naics_00_table5.txt) into R, but I am not sure how to go about parsing it. As you can see, the column names (years) are not located all on the same row, and the space between data is not consistent from column to column. I am familiar with using read.csv() and read.delim(), but I'm not sure how to go about reading a complex file like this one.

Killbill
  • 643
  • 1
  • 15

1 Answers1

0

Here is a manual parse:

require(readr)
string = read_lines(file="https://www.bls.gov/bdm/us_age_naics_00_table5.txt")
string = string[nchar(string) != 0]
string = string[-c(1,2)]  # don't contain information
string = string[string != " "]
string = string[-151]     # footnote
sMatrix = matrix(string, nrow = 30)
dfList = sapply(1:ncol(sMatrix), function(x) readr::read_table(paste(sMatrix[,x])))
df = do.call(cbind,dfList)
df = df[,!duplicated(colnames(df))] # removes columns with duplicate names

If you then want to recode "_" as NA, and format the numbers:

df[df == "_"] = NA
df = as.data.frame(sapply(df, function(x) gsub(",","",x)))
i <- apply(df, 2, function(x) !any(is.na(as.numeric(na.omit(x))))) # if a column can be converted to numeric without any NAs, e.g. column 1 can't
df[,i] = lapply(df[,i], as.numeric)
VitaminB16
  • 1,165
  • 1
  • 15
  • I seem to get a problem at line 7 of your code. I get the error: Error in file(file, "rt") : invalid 'description' argument 5. file(file, "rt") 4. read.table(paste(sMatrix[, x])) 3. FUN(X[[i]], ...) 2. lapply(X = X, FUN = FUN, ...) 1. sapply(1:ncol(sMatrix), function(x) read.table(paste(sMatrix[, x]))) – Anthony Colavito May 28 '21 at 14:47
  • Sorry. You'll need to install `readr` package. I've updated the code – VitaminB16 May 28 '21 at 18:45