0

I have a vector if ID's which i need to split into sub fields. The length of the subfields are constant which I hope will make things straightforward. Currently the ID field looks like this:

ID
0100001000
0100002000
0100003000
0100004000
0100005000
0100006000
0100007000
0100008000
0100009000
0100010000

and I need to split it into sub ID fields like so:

06  00546   000
12  00387   000
21  02437   000
01  06419   000
17  03892   000
17  00010   000
13  02199   000
17  00706   000
05  03358   000
05  03892   000

These values are just examples of format, not content ie the example above just shows that i need to take a string of xxxxxxxxxx and turn it into xx xxxxx xxx please ignore the values.

I'm looking for a solution I can implement in R and I have the feeling I need to be using regular expressions for this but need a nudge in the right direction.

Connor M
  • 182
  • 2
  • 12

2 Answers2

4

One option is

library(tidyr)
extract(df1, 'ID', into=c('ID1', 'ID2', 'ID3'), '(.{2})(.{5})(.{3})')
#    ID1   ID2 ID3
# 1   01 00001 000
# 2   01 00002 000
# 3   01 00003 000
# 4   01 00004 000
# 5   01 00005 000
# 6   01 00006 000
# 7   01 00007 000
# 8   01 00008 000
# 9   01 00009 000
#10   01 00010 000

Or read the file using read.fwf with specified widths.

read.fwf('file.txt', widths=c(2,5,3), skip=1, #skip to remove the ID row
             header=FALSE,colClasses=rep('character',3))
#   V1    V2  V3
#1  01 00001 000
#2  01 00002 000
#3  01 00003 000
#4  01 00004 000
#5  01 00005 000
#6  01 00006 000
#7  01 00007 000
#8  01 00008 000
#9  01 00009 000
#10 01 00010 000
akrun
  • 674,427
  • 24
  • 381
  • 486
2

You could do like this also.

> df <- data.frame(ID=c("0100001000", "0100002000", "0100003000"))
> df
          ID
1 0100001000
2 0100002000
3 0100003000
> as.data.frame(do.call(rbind, regmatches(df$ID, gregexpr("^\\d{2}|(?<=^\\d{2})\\d{5}|\\d{3}$", df$ID,perl=T))))
  V1    V2  V3
1 01 00001 000
2 01 00002 000
3 01 00003 000

OR

> library(stringi)
> as.data.frame(do.call(rbind, stri_split(as.character(df$ID), regex="(?<=^\\d{2})|(?=\\d{3}$)")))
  V1    V2  V3
1 01 00001 000
2 01 00002 000
3 01 00003 000
Avinash Raj
  • 160,498
  • 22
  • 182
  • 229