Splitting Long integer (ID field) into sub fields using R

Question

I have a vector if ID's which i need to split into sub fields. The length of the subfields are constant which I hope will make things straightforward. Currently the ID field looks like this:

and I need to split it into sub ID fields like so:

06  00546   000
12  00387   000
21  02437   000
01  06419   000
17  03892   000
17  00010   000
13  02199   000
17  00706   000
05  03358   000
05  03892   000

These values are just examples of format, not content ie the example above just shows that i need to take a string of xxxxxxxxxx and turn it into xx xxxxx xxx please ignore the values.

I'm looking for a solution I can implement in R and I have the feeling I need to be using regular expressions for this but need a nudge in the right direction.

i'm wondering how `0100001000` was splitted into `06 00546 000` — Avinash Raj, Mar 25 '15 at 14:02
Does this help? http://stackoverflow.com/questions/2247045/chopping-a-string-into-a-vector-of-fixed-width-character-elements — Sam Firke, Mar 25 '15 at 14:08
@AvinashRaj It wasn't this is just an example of format, not content — Connor M, Mar 25 '15 at 14:32

akrun · Accepted Answer · 2015-03-25T14:46:39.647

One option is

library(tidyr)
extract(df1, 'ID', into=c('ID1', 'ID2', 'ID3'), '(.{2})(.{5})(.{3})')
#    ID1   ID2 ID3
# 1   01 00001 000
# 2   01 00002 000
# 3   01 00003 000
# 4   01 00004 000
# 5   01 00005 000
# 6   01 00006 000
# 7   01 00007 000
# 8   01 00008 000
# 9   01 00009 000
#10   01 00010 000

Or read the file using read.fwf with specified widths.

read.fwf('file.txt', widths=c(2,5,3), skip=1, #skip to remove the ID row
             header=FALSE,colClasses=rep('character',3))
#   V1    V2  V3
#1  01 00001 000
#2  01 00002 000
#3  01 00003 000
#4  01 00004 000
#5  01 00005 000
#6  01 00006 000
#7  01 00007 000
#8  01 00008 000
#9  01 00009 000
#10 01 00010 000

@AvinashRaj Thanks, I was about to make that correction. – akrun Mar 25 '15 at 14:46 — akrun, Mar 25 '15 at 14:46

Avinash Raj · Answer 2 · 2015-03-25T15:11:03.703

2

You could do like this also.

> df <- data.frame(ID=c("0100001000", "0100002000", "0100003000"))
> df
          ID
1 0100001000
2 0100002000
3 0100003000
> as.data.frame(do.call(rbind, regmatches(df$ID, gregexpr("^\\d{2}|(?<=^\\d{2})\\d{5}|\\d{3}$", df$ID,perl=T))))
  V1    V2  V3
1 01 00001 000
2 01 00002 000
3 01 00003 000

OR

> library(stringi)
> as.data.frame(do.call(rbind, stri_split(as.character(df$ID), regex="(?<=^\\d{2})|(?=\\d{3}$)")))
  V1    V2  V3
1 01 00001 000
2 01 00002 000
3 01 00003 000

edited Mar 25 '15 at 15:11

answered Mar 25 '15 at 14:54

Avinash Raj

160,498
22
182
229

i can't do this through `strsplit` but `stri_split` does the job. – Avinash Raj Mar 25 '15 at 15:10
it's ok, no prob.. :-) – Avinash Raj Mar 25 '15 at 15:59

Splitting Long integer (ID field) into sub fields using R

2 Answers2