37

I've looked through a number of posts about ordering factors, but haven't quite found a match for my problem. Unfortunately, my knowledge of R is still pretty rudimentary.

I have a subset of an archaeological artifact catalog that I'm working with. I'm trying to cross-tabulate diagnostic historical artifact types and site testing locations. Easy enough with ddply or tapply.

My problem is that I want to sort the artifact types (a factor) by their mean diagnostic date (number/year), and I keep getting them alphabetically. I know I need to make it an ordered factor, but can't figure out how to order it by the year value in the other column.

IDENTIFY                                      MIDDATE
engine-turned fine red stoneware              1769
white salt-glazed stoneware, scratch blue     1760
wrought nail, 'L' head                        1760
yellow lead-glazed buff earthenware           1732
...

Which needs to be ordered:

IDENTIFY                                      MIDDATE
yellow lead-glazed buff earthenware           1732
white salt-glazed stoneware, scratch blue     1760
wrought nail, 'L' head                        1760
engine-turned fine red stoneware              1769
...

The factor (IDENTIFY) needs to be ordered by the Date (MIDDATE). I thought I had it with

Catalog$IDENTIFY<-factor(Catalog$IDENTIFY,levels=Catalog$MIDDATE,ordered=TRUE)

But get the warning:

In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) 
else paste0(labels,: duplicated levels will not be allowed 
in factors anymore

IDENTIFY has ~130 factor levels and many have the same value for MIDDATE, so I need to order IDENTIFY by MIDDATE and another column TYPENAME.

A little more detail:

I have a dataframe Catalog, which breaks down (i.e. str(Catalog)) as:

> str(Catalog)
'data.frame':   2211 obs. of  15 variables:
 $ TRENCH  : Factor w/ 7 levels "DRT 1","DRT 2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ U_TYPE  : Factor w/ 3 levels "EU","INC","STP": 1 1 1 1 1 1 1 1 1 1 ...
 $ U_LBL   : Factor w/ 165 levels "001","005","007",..: 72 72 72 72 72 72 ...
 $ L_STRAT : Factor w/ 217 levels "#2-7/25","[3]",..: 4 4 4 4 4 4 89 89 89 89 ...
 $ START   : num  0 0 0 0 0 0 39.4 39.4 39.4 39.4 ...
 $ END     : num  39.4 39.4 39.4 39.4 39.4 39.4 43.2 43.2 43.2 43.2 ...
 $ Qty     : int  1 1 3 5 1 1 6 8 1 1 ...
 $ MATNAME : Factor w/ 6 levels "Ceramics","Chipped Stone",..: 1 1 1 5 5 6 ...
 $ TYPENAME: Factor w/ 9 levels "Architectural Hardware",..: 9 9 9 1 1 3 9 ...
 $ CATNAME : Factor w/ 32 levels "Biface","Bottle Glass",..: 24 29 29 6 24 ...
 $ IDENTIFY: Factor w/ 112 levels "amethyst bottle glass",..: 17 91 96 71 103 ...
 $ BEGDATE : int  1820 1820 1830 1835 1700 NA 1670 1762 1800 1720 ...
 $ ENDDATE : int  1900 1970 1860 1875 1820 NA 1795 1820 1820 1780 ...
 $ OCC_LBL : Ord.factor w/ 5 levels "Late 19th Century"<..: 2 1 2 2 4 5 4 3 ...
 $ MIDDATE : num  1860 1895 1845 1855 1760 ...

I need to make IDENTIFY an ordered factor, and reorder by MIDDATE -> TYPENAME -> alpha by IDENTIFY.

What I'm not really getting is how to reorder by combined orders from multiple columns.

I would just do this in the db, but a lot of what I'm running are weighted means within all sorts of cross-tabulations (e.g. weighted mean depths below ground surface for artifact classes by location)...

...doable in Access, but messy, and unpredictable. Far easier and cleaner to manage in R, but I'd rather not to have to manually re-sort the resulting tables.

What I'm trying to produce is a number of things along these lines:

>xtab.Catalog<-tapply(Catalog$Qty,list(Catalog$IDENTIFY,Catalog$TRENCH),sum)

IDENTIFY                        DRT1    DRT2    DRT3    DRT4    DRT5    DRT6
Staffordshire stoneware         4       NA      NA      NA      NA      NA  
undecorated delftware           6       4       NA      NA      NA      NA  
unidentified wrought nail       15      9       3       1       3       NA  
white salt-glazed stoneware     6       1       1       NA      2       1   
white salt-glazed scratch blue  1       NA      NA      NA      NA      NA  
white stoneware, slip-dipped    NA      NA      NA      NA      NA      NA  
wrought nail, 'L' head          2       NA      NA      NA      NA      NA  
wrought nail, 'rose' head       62      21      4       NA      1       1   
wrought nail, 'T' head          2       NA      1       NA      NA      1   
yellow lead-glazed              12      NA      NA      NA      1       3   
...

...but I need them to sort in the logical (i.e. chronological/type) order instead of alphabetical.

double-beep
  • 3,889
  • 12
  • 24
  • 35
Scard
  • 733
  • 2
  • 7
  • 14
  • 1
    Have a look at the function `reorder()` in the stats package (i.e. base R). – Andrie May 25 '12 at 16:46
  • Tried `reorder(Catalog$IDENTIFY,Catalog$MIDDATE)` but got the same warning about duplicate levels and IDENTIFY all turned to `NA`'s... – Scard May 25 '12 at 18:40
  • 1
    It's a bit hard to help you without reproducible code. Can you try and make a reproducible example, and I shall have a look? – Andrie May 25 '12 at 18:59
  • Just tacked up more detail on what I'm doing... apologies for my being only semi-literate in R. Had been using SPSS for these types of things, but our office stopped updating it (stuck with a buggy v11). – Scard May 25 '12 at 19:03
  • 1
    A minimal reproducible example contains code anyone can run on their machine and get the same error / warning. See http://stackoverflow.com/q/5963269/602276 for more detail. Anyway, you are getting a warning, not an error, so evaluate your results to see whether it isn't in fact ok as it is. – Andrie May 25 '12 at 19:05
  • Along with the warning, the whole `IDENTIFY` column gets blanked as `NA` with the method I tried before, too. I'm guessing that's because MIDDATE values aren't entirely unique? That's why I need to figure out ordering across the three columns, though. It may be that I'm just not grasping how R handles "fields" that way... I'm more a db guy than a programmer. – Scard May 25 '12 at 19:24
  • The values in Catalog$MIDDATE are not appropriate for factor levels in IDENTIFY. You want to sort as.character(IDENTIFY), but use the given criteria for the comparison. In other languages, I would pass an appropriate compare function in the call to sort(), but I do not know how to do that in R. – Matthew Lundberg May 26 '12 at 00:24
  • Looking at ?reorder, it does take a function FUN, but sorts the data such that FUN(data) is non-decreasing. I am familiar with sorts using compare routines that return 0, +ve or -ve to indicate order only. However, here, you can create a function that returns values which can be sorted, resulting in the desired order. Much less nice than presenting a compare function. Is there a better alternative? – Matthew Lundberg May 26 '12 at 00:32
  • Relevant blog entry: https://www.r-bloggers.com/reorder-factor-levels/ – mitchus Jan 09 '19 at 11:01

3 Answers3

44

Here's a reproducible sample, with solution:

set.seed(0)
a = sample(1:20,replace=F)
b = sample(1:20,replace=F)
f = as.factor(letters[1:20])

> a
 [1] 18  6  7 10 15  4 13 14  8 20  1  2  9  5  3 16 12 19 11 17
> b
 [1] 16 18  4 12  3  5  6  1 15 10 19 17  9 11  2  8 20  7 13 14
> f
 [1] a b c d e f g h i j k l m n o p q r s t
Levels: a b c d e f g h i j k l m n o p q r s t

Now for the new factor:

fn = factor(f, levels=unique(f[order(a,b,f)]), ordered=TRUE)

> fn
 [1] a b c d e f g h i j k l m n o p q r s t
20 Levels: k < l < o < f < n < b < c < i < m < d < s < q < g < h < e < ... < j

Sorted on 'a', next 'b' and finally 'f' itself (although in this example, 'a' has no repeated values).

JelenaČuklina
  • 2,839
  • 2
  • 19
  • 32
Matthew Lundberg
  • 39,899
  • 6
  • 81
  • 105
  • Thank you. I'll give this a try and see how it goes. I suspect I'll need to make a separate ordered factor vector out of MIDDATE, but then the combinations of the fields should prevent the duplicates problem (the three should make for a unique position). – Scard May 31 '12 at 22:47
  • 2
    If someone has answered you question may I suggest using the check mark (color it green) to indicate so. This is polite and of great help to future searchers. – Tyler Rinker Dec 28 '12 at 22:22
26

I recommend the following dplyr-based approach (h/t daattali) that can be extended to as many columns as you like:

library(dplyr)
Catalog <- Catalog %>%
  arrange(MIDDATE, TYPENAME) %>%               # sort your dataframe
  mutate(IDENTIFY = factor(IDENTIFY, unique(IDENTIFY))) # reset your factor-column based on that order
JelenaČuklina
  • 2,839
  • 2
  • 19
  • 32
zach
  • 22,141
  • 16
  • 57
  • 86
3

The function fct_reorder2 is doing just that.

Please note the subtlety that fct_reorder is sorting by ascending order while fct_reordering2 is sorting by descending order.

Code from the documentation:

df0 <- tibble::tribble(
  ~color,     ~a, ~b,
  "blue",      1,  2,
  "green",     6,  2,
  "purple",    3,  3,
  "red",       2,  3,
  "yellow",    5,  1

)

df0$color <- factor(df0$color)
fct_reorder(df0$color, df0$a, min)
 #> [1] blue   green  purple red    yellow
 #> Levels: blue red purple yellow green
fct_reorder2(df0$color, df0$a, df0$b)
YCR
  • 3,114
  • 20
  • 29