This is a really good question and it comes up (among other places) in tensor category theory, where equality is replaced by canonical isomorphism and suddenly things cannot be handwaved away... I'll try to give an answer; unfortunately it is lacking in illustrations (due to math.stackexchange not having packages like xypic) and examples (due to my laziness). Sorry also for its length...

I will write $R$ for $\mathbb R$ because really there is nothing specific about $\mathbb R$ that you are using here (and, honestly, because I am too lazy to type in the \mathbb part all the time). Here are two ways of stating general associativity in $R$:

**First way:** Here is a mild variation on the way to state "general associativity" that you have suggested: We can define, for every positive integer $n$, a map $\operatorname{sum}_n : R^n \to R$; we define it by induction over $n$:

The map $\operatorname{sum}_1 : R^1 \to R$ is defined as the map sending each $\left(r\right) \in R^1$ to $R$.

If $\operatorname{sum}_n : R^n \to R$ is defined for some $n$, then $\operatorname{sum}_{n+1} : R^{n+1} \to R$ is defined by $\operatorname{sum}_{n+1} \left(a_1,a_2,...,a_{n+1}\right) = \operatorname{sum}_n \left(a_1,a_2,...,a_n\right) + a_{n+1}$.

Then, "general associativity" states that whenever $n$ and $k$ are integers such that $0 < k < n$, and whenever $\left(r_1,r_2,...,r_n\right)\in R^n$ is an $n$-tuple, we have

$\operatorname{sum}_n \left(r_1,r_2,...,r_n\right) = \operatorname{sum}_k \left(r_1,r_2,...,r_k\right) + \operatorname{sum}_{n-k} \left(r_{k+1},r_{k+2},...,r_n\right)$.

So you don't like this formulation, because rather than explicitly formalizing the concept of a bracketing, it only formalizes its "first step" (i.e., it allows "moving the outermost brackets"). Before I continue, let me say that **for most applications this is perfectly enough**, as you can use induction instead of referring to some obscure notion of "bracketings" and your proofs will likely only gain in clarity. But sometimes one really has to go the final mile. This is more complicated:

**Second way:** The best way to formalize the concept of a bracketing (for a binary operation) is the notion of a *binary tree*. The Wikipedia page on this is quite confusing, and the other relevant pages on the Wikipedia (associahedron, Tamari lattice) are not very elementary, so let me try to provide a definition. The simplest way to define an (unlabelled) *binary tree with $n$ leaves* (for $n$ a positive integer) is to say that:

the empty set $\emptyset$ is a binary tree with $1$ leaf, called the *empty binary tree* (don't think of the leaves as any actual objects -- all we care about here is the number of leaves of a tree, which is just a number that we keep track of);

if $A$ and $B$ are two binary trees with $a$ and $b$ leaves, respectively, then the pair $\left(A,B\right)$ is a binary tree with $a+b$ leaves; this tree is said to have *left child* $A$ and *right child* $B$ (bad terminology if you ask me, since these children are used to create the tree...).

This is an inductive definition, meaning that we consider all objects which are forced to be binary trees by these axioms to be binary trees, and no others.

So each of $\emptyset$, $\left(\emptyset, \emptyset\right)$, $\left(\left(\emptyset, \emptyset\right), \emptyset\right)$, $\left(\emptyset,\left(\left(\emptyset,\left(\emptyset,\emptyset\right)\right),\emptyset\right)\right)$ is a binary tree. The triple $\left(\emptyset,\emptyset,\emptyset\right)$ is not a binary tree, because we have no axiom that could make a triple a binary tree.

(Caveat lector: Binary trees are not trees in the sense of graph theory, and actually not trees in the sense of computer science either. Normally, "trees" are not allowed to be empty, and there is no distinction between left and right children. Here the empty tree exists and so does the left-right distinction: (for example) the binary trees $\left(\left(\emptyset, \emptyset\right), \emptyset\right)$ and $\left(\emptyset,\left(\emptyset, \emptyset\right)\right)$ are not the same.)

Of course, binary trees are so called because they can be drawn as trees. To draw the empty binary tree, draw a single dot, which is called the "root" of the tree. To draw a tree of the form $\left(A,B\right)$, you need to draw a dot, which stands for the "root" of this tree, and then one edge from this "root" that goes southwest and another edge from this "root" that goes southeast. Now, draw the tree $A$ (this is a recursive algorithm) in such a way that its "root" is placed at the end of the southwest edge. Also, draw the tree $B$ in such a way that its "root" is placed at the end of the southeast edge.

Sadly I cannot draw on here, but if you look at this image and pretend that $\alpha$, $\beta$, $\gamma$ and the circles are dots, then the left binary tree is $\left(\emptyset, \left(\emptyset, \emptyset\right)\right)$, and the right binary tree is $\left(\left(\emptyset, \emptyset\right), \emptyset\right)$. Also, in this picture (think of the halfmoons/bananas as dots too) you see all five possible binary trees with $4$ leaves. (Yes, the "leaves" are the halfmoons/bananas; they are dots from which no edges go down.)

Now, to any nonnegative integer $n$ and any binary tree $T$ with $n$ leaves, we can assign a map $s_T : R^n \to R$; it is defined (recursively) as follows:

If $T$ is the empty binary tree (so that $n=1$), then $s_T : R^1 \to R$ is the map sending each $\left(r\right)$ to $r$.

If $T$ has the form $\left(A,B\right)$ for two binary trees $A$ and $B$ having respectively $a$ and $b$ leaves, then we already have maps $s_A : R^a \to R$ and $s_B : R^b \to R$ (this is an inductive definition), and we need to define a map $s_T : R^{a+b} \to R$. Define it by $s_T\left(r_1,r_2,...,r_{a+b}\right) = s_A\left(r_1,r_2,...,r_a\right) + s_B\left(r_{a+1},r_{a+2},...,r_{a+b}\right)$.

With the maps $s_T$ thus defined, "general associativity" says that for any fixed positive integer $n$, the maps $s_T : R^n \to R$ for all possible binary trees $T$ with $n$ leaves are the same.

Let us see what this means for $n = 1$. There is only one binary tree with $1$ leaf, namely the empty tree, and the corresponding map sends each $\left(r_1\right) \in R^1$ to $r_1$. So the statement we are making is trivial for $n = 1$.

Let us see what "general associativity" means for $n = 2$. There is only one binary tree with $2$ leaves, namely the tree $\left(\emptyset,\emptyset\right)$, and the corresponding map sends each $\left(r_1,r_2\right) \in R^2$ to $r_1 + r_2$. Again the statement is trivial.

The first nontrivial consequence is what we get for $n = 3$. There are two binary trees with $3$ leaves, namely $\left(\emptyset, \left(\emptyset, \emptyset\right)\right)$ and $\left(\left(\emptyset, \emptyset\right), \emptyset\right)$. The $s_T$ for the first of these trees is the map sending each $\left(r_1,r_2,r_3\right) \in R^3$ to $r_1 + \left(r_2+r_3\right)$. The $s_T$ for the second of these trees is the map sending each $\left(r_1,r_2,r_3\right) \in R^3$ to $\left(r_1+r_2\right) + r_3$. So "general associativity" for $n=3$ yields that $r_1 + \left(r_2+r_3\right) = \left(r_1+r_2\right) + r_3$ for all $r_1,r_2,r_3 \in R$. This is just the usual associativity law.

For $n = 4$, there are five binary trees with $4$ leaves, which I list here along with the corresponding maps $s_T$:

$\left(\emptyset, \left(\emptyset, \left(\emptyset,\emptyset\right)\right)\right)$, with $s_T$ sending $\left(r_1,r_2,r_3,r_4\right) \in R^4$ to $r_1+\left(r_2+\left(r_3+r_4\right)\right)$;

$\left(\emptyset, \left(\left(\emptyset,\emptyset\right), \emptyset\right)\right)$, with $s_T$ sending $\left(r_1,r_2,r_3,r_4\right) \in R^4$ to $r_1+\left(\left(r_2+r_3\right)+r_4\right)$;

$\left(\left(\emptyset, \emptyset\right), \left(\emptyset,\emptyset\right)\right)$, with $s_T$ sending $\left(r_1,r_2,r_3,r_4\right) \in R^4$ to $\left(r_1+r_2\right)+\left(r_3+r_4\right)$;

$\left(\left(\emptyset, \left(\emptyset,\emptyset\right)\right), \emptyset\right)$, with $s_T$ sending $\left(r_1,r_2,r_3,r_4\right) \in R^4$ to $\left(r_1+\left(r_2+r_3\right)\right)+r_4$;

$\left(\left(\left(\emptyset,\emptyset\right), \emptyset\right), \emptyset\right)$, with $s_T$ sending $\left(r_1,r_2,r_3,r_4\right) \in R^4$ to $\left(\left(r_1+r_2\right)+r_3\right)+r_4$.

So "general associativity" for $n=4$ says that

$r_1+\left(r_2+\left(r_3+r_4\right)\right) = r_1+\left(\left(r_2+r_3\right)+r_4\right) = r_1+\left(\left(r_2+r_3\right)+r_4\right) = \left(r_1+\left(r_2+r_3\right)\right)+r_4 = \left(\left(r_1+r_2\right)+r_3\right)+r_4$.

In general, the "bracketings" of $n$ symbols using a fixed binary operation correspond to the binary trees with $n$ leaves.

Now this got damn long and nowhere near readable. Can any LaTeX wizard step in and suggest a simple drawing package that does work on math.stackexchange? It doesn't take much to draw a tree, and it should clear up a lot...

**EDIT:** Here is a third way, which really is completely equivalent to the second way, but has the advantage of a less confusing definition. The idea is to use Dyck words instead of trees. (The Wikipedia page on Catalan numbers actually provides a good synopsis of what I am saying -- and some good pictures of binary trees. It calls "full binary tree" what I call "binary tree".)

**Third way:** What is a Dyck word? In combinatorics, "word" is just a fancy word (oops) for "tuple" (usually finite), and the "letters" of the word simply mean the entries of the tuple. Unless there is risk of confusion with actual products (or numbers), one abbreviates a tuple (or word) $\left(a_1,a_2,...,a_k\right)$ as $a_1a_2...a_k$. So the word $\left(0,1,1,1,0\right)$ is written as $01110$.

Now, let $n \in \mathbb{N}$. A *Dyck word of length $2n$* is defined to be a word whose letters are $0$'s and $1$'s, each appearing $n$ times in total (so the word has altogether $2n$ letters), such that for every $i$, there are at least as many $0$'s among the first $i$ letters as there are $1$'s among them. So, for example, $0110$ is not a Dyck word, because among the first $3$ letters there are fewer $0$'s than $1$'s (namely, one $0$ and two $1$'s). Also, $10001$ is not a Dyck word, because among the first $1$ letter there are fewer $0$'s than $1$'s (no $0$ at all and one $1$). Also, $0100$ is not Dyck because its total number of $0$'s does not equal its total number of $1$'s. But $001011$ is a Dyck word (of length $2\cdot 3=6$), as you can easily check: read the word from left to right, keeping track of how many $0$'s and how many $1$'s you have encountered; if the count for $1$'s overtakes the count for $0$'s, then your word is not Dyck; if both counts are equal at the end, then it is Dyck; otherwise it is not Dyck.

(Wikipedia uses the two letters "X" and "Y" instead of $0$ and $1$; other than this, there is no difference.)

Now, to any nonnegative integer $n$ and any Dyck word $w$ of length $2n$, we can assign a map $s_w : R^{n+1} \to R$; it is defined (recursively) as follows:

If $n = 0$ (so that the word $w$ is empty -- a $0$-tuple), then $s_w : R^1 \to R$ is the map sending each $\left(r\right)$ to $r$.

Otherwise, let $j$ be the **smallest** positive $i$ such that the number of $0$'s among the first $i$ letters of $w$ equals the number of $1$'s among the first $i$ letters of $w$. (Such an $i$ exists, because $i = 2n$ does the trick; thus, the smallest such $i$ also exists.) Notice that this $j$ must be even, because otherwise the number of $0$'s among the first $j$ letters of $w$ could not equal the number of $1$'s among these letters for parity reasons. Write $w$ as $w_1w_2...w_{2n} = \left(w_1,w_2,...,w_{2n}\right)$. (Then, it is easy to see that $w_1 = 0$ and $w_j = 1$.) Let $u$ be the word $w_2w_3...w_{j-1}$, and let $v$ be the word $w_{j+1}w_{j+2}...w_{2n}$. (It is easy to see that both $u$ and $v$ are Dyck words.) Then, we define a map $s_w : R^{n+1} \to R$ by $s_w\left(r_1,r_2,...,r_{n+1}\right) = s_u\left(r_1,r_2,...,r_{j/2}\right) + s_v\left(r_{j/2+1},r_{j/2+2},...,r_n\right)$.

With the maps $s_w$ thus defined, "general associativity" says that for any fixed positive integer $n$, the maps $s_w : R^{n+1} \to R$ for all possible Dyck words of length $2n$ are the same.

Let us see what this means for $n = 1$. We have only one Dyck word of length $2\cdot 1 = 2$, namely $01$. Let this word be $w$. Looking up the definition of $s_w$, we see that $u$ and $v$ both are the empty word. Hence, every $\left(r_1,r_2\right) \in R^2$ satisfies $s_w\left(r_1,r_2\right) = s_u\left(r_1\right) + s_v\left(r_2\right) = r_1+r_2$. Of course, "general associativity" says nothing insightful, because there is only one $w$.

Let us see what we get for $n = 2$. We have two Dyck words of length $2\cdot 2 = 4$, namely $0011$ and $0101$.

Let $w$ be the Dyck word $0011$. Then, in our above definition of $s_w$, we have $j = 4$, $u = 01$ and $v = \left(\text{empty word}\right)$. Hence, every $\left(r_1,r_2,r_3\right) \in R^3$ satisfies

$s_w\left(r_1,r_2,r_3\right) = \underbrace{s_u\left(r_1,r_2\right)}_{=r_1+r_2} + \underbrace{s_v\left(r_3\right)}_{=r_3} = \left(r_1+r_2\right) + r_3$.

Similarly, if $w$ is the Dyck word $0101$, then $s_w\left(r_1,r_2,r_3\right) = r_1 + \left(r_2+r_3\right)$. So "general associativity" for $n = 2$ claims $\left(r_1+r_2\right) + r_3 = r_1 + \left(r_2+r_3\right)$, which is exactly what you would expect (except that the $n$ here is shifted by $1$ compared to the second way above).

You asked for references. Unfortunately I also know no better than google for "bracketings and associativity", and it is very hard to find reader-friendly treatments of these formalization issues. There is Huang and Tamari's famous *Problems of associativity* paper, but this is not about formalizing associativity; it is about the combinatorics of the bracketings (or Dyck words or binary trees). A whole book dedicated to this kind of combinatorics (*Associahedra, Tamari Lattices and Related Structures*) has been published in 2012, and you can find all (or most?) of its chapters on the internet (for example, Ross Street's *Parenthetic Remarks*) if you are interested. Sadly, this is again on a level where questions like "how is a bracketing defined?" appear as too trivial. These foundational issues are usually ignored when they arise in elementary algebra because authors are too lazy to discuss them or do not want to expect their readers to put up with the large amount of formalism and pedantry involved in their discussion. But sometimes people care about them when they reappear in the theory of tensor categories (aka monoidal categories) because they are less easy to argue away in that situation (and the First way I showed above is often not enough for tensor categories); many nevertheless try to. MacLane, in his *Natural associativity and commutativity*, uses "iterates" of the associativity morphism in a category to encode our bracketings. It is not very explicit but his paper seems fairly well-written (but you have to know some category theory...).

I should also say that a few good algebra texts prove "general associativity" formalized according to the First way I sketched above, or in similar veins. See, for instance, Theorem 2 in §I.1 of Claude Chevalley, *Fundamental Concepts of Algebra*, 1956, which gives a slightly stronger version of the First way.