I was recently looking at comparing expression trees in ways that treated associative and commutative operations associatively and commutatively. For example,
A+1+B+2+3==4 AND NOT (C>3 OR B!=3+2)
One approach was to write a hash function for expression trees that treated its subtrees commutatively and associatively. But only when they're actually commutative and associative. (a AND b AND c) is, and so is (a + b + c), but (a * b + c * d), there might be trouble. I found such a hash. But this approach Equals() to work too, and Equals() is going to be more expensive than a simple tree comparison, because it's got to match up associative children (perhaps by hash value).
A better approach, at least if Hash() and Equals() can be expected to happen at least once per expression, is to normalize the expression up front then do the obvious implementations of Hash() and Equals(). (Those are, an order-dependent hash of this node and its children for Hash(), and a top-down simple tree comparison for Equals() except when the root nodes happen to be the same object. The result of Hash() can even be cached in each node if expression trees are immutable.)
What's the normalized form of (3+2+1)? Might it be (1+2+3)? No, it's 6. Constant folding is part of normalization. I originally thought I'd expose constant folding, but found there was no point, I just had to expose normalization. Another: (2<3 ? 5 : b+3) ... that's not constant, it's got a b in it. But it is constant, because 2<3 is true, so it always chooses 5 not b+3. You can't tell it is constant without first actually evaluating its condition. Same goes for (2>3 AND b>3
So how do you do normalization? The examples above would like it done bottom-up. But consider ((d+c)+b)+a, which you'd like to constant fold as ((a+b)+c)+d. Especially if you want immutable trees, which requires a full tree replacement (not just a spot edit) if you want to go from (b+c)+d to ((a+b)+c)+d. Bottom-up is going to be O(n*n) in the number of nodes in the tree or worse. These associative and commutative rewrites want to be top down. Sort of. Actually, they start at the top, recurse down to identify all the possibly commuting children, normalize the children, then build a tree out of the normalized children. They use the hashes of the children to order the children, and the hashes aren't right until the children have been normalized. Another optimization that favors top-down is NOT, which likes to be pushed down through AND and OR (at least in databases, which can do useful things with toplevel AND and OR chains but not with a toplevel NOT).
My current tack is to do normalization of expressions in two passes: one bottom up (which decides what expressions are constant, and does some bottom up rewrites that don't care exactly what their subexpressions are), then another pass kind of top down (doing NOT and associativity rewrites).
Back to the original example:
A+1+B+2+3==4 AND NOT (C>3 OR B!=3+2)That A+1+B+2+3, it would be nice to collect all the constants and constant-fold them. That's a matter of tweaking the comparison of subexpressions to treat constants all higher or all lower than everything else. So perhaps you'd get 6+A+B==4.
Should normalization go so far as to change 6+A+B==4 int A+B==-2? Or C==C into true? Or A>B AND B>C AND C>A into false? Or sin(x)*sin(x)+cos(x)*cos(x)==1 into true? I haven't gone there, so far.