It has nothing to do with being easier to work with (at least, not in this day and age). The biggest reason is that minimizing sum of squares of residuals gives the maximum likelihood estimator if you assume that the error is iid normal.
If your model is different (y = Ax + b + e where the error e is not normal) then it could be that a different penalty function is more appropriate. In the real world, this is actually very often the case, because the error can be long-tailed. The power of 1 is sometimes used. Also common is the Huber loss function, which coincides with e^2 (residual squared) for small values of e but is linear for larger values. This has the effect of putting less weight on outliers: it is "robust".
In principle, if you knew the distribution of the noise/error, you could calculate the correct penalty function to give the maximum likelihood estimate. More on this (with explicit formulas) in Boyd and Vandenberghe's "Convex Optimization" (freely available on their website), pp. 352-353.
Edit: I remembered another reason. Least squares fits are also popular because they are what is required for ANOVA, a very old and still-popular methodology for breaking down variance into components (this is what people refer to when they say things like "75% of the variance is due to <predictor>"). ANOVA is fundamentally based on the pythagorean theorem, which lives in Euclidean geometry and requires squares. So as I understand it ANOVA demands that you do a least-squares fit, even if it's not really appropriate for the situation.
The mean minimizes L2 norm, so I guess there's some connection there if you derive OLS by treating X, Y as random variables and trying to estimate Y conditioned on X in a linear form. "L[Y|X] = E[Y] + a*(X - E[X])"
If the dataset truly is linear then we'd like this linear estimator to be equivalent to the conditional expectation E[Y|X], so we therefore use the L2 norm and minimize E[(Y - L[Y|X])^2]. Note that we're forced to use the L2 norm since only then will the recovered L[Y|X] correspond to the conditional expectation/mean.
I believe this is similar to the argument other commenter mentioned of being BLUE. The random variable formulation makes it easy to see how the L2 norm falls out of trying to estimate E[Y|X] (which is certainly a "natural" target). I think the Gauss-Markov Theorem provides more rigorous justification under what conditions our estimator is unbiased, that E[Y|X=x] = E[Lhat | X=x] (where L[Y|X] != LHat[Y|X] because we don't have access to the true population when we calculate our variance/covariance/expectation) and that under those conditions, Lhat is the "best": that Var[LHat | X=x] <= Var[Lh' | X=x] for any other unbiased linear estimator Lh'.
Wikipedia has some notes on why least squares, and how you might get there from other assumptions: https://en.wikipedia.org/wiki/Least_squares#Statistical_test... .
Also, quadratics are just much easier to work with in a lot of ways than higher powers. Like you said, even powers have the advantage over odd powers of not needing any sort of absolute value, but quartic equations of any kind are much harder to work with than quadratics. A local optimum on a quartic isn't necessarily a global optimum, you lose the solvability advantages of having linear derivatives, et cetera.
To put it very simplistically, from mostly a practical view:
abs: cannot be differentiated around 0; has multiple minima; the error space has sharp ridges
power 4: way too sensitive to noise
power 3: var(x+y) != var(x) + var(y)
A power of 1 doesn’t guarantee a unique solution. A simple example has 3 points:
(0,0), (1,0), (1,1)
Any
y = a × x with a between zero and one gives you a sum of errors of 1.
Powers less than 1 have the undesired property that they will prefer making one large error over multiple small ones. With the same 3-point example and a power of ½, you get:
- for y = 0, a cumulative error of 1
- for y = x/2, a cumulative error of 2 times √½. That’s √2 or about 1.4
- for y = x, a cumulative error of 1
(Underlying reason is that √(|x|+|y|) < √|x| + √|y|. Conversely for powers p larger than 1, we have *(|x|+|y|)^p > |x|^p + |y|^p)
Odd powers would require you to take absolute differences to avoid getting, for example, an error of -2 giving a contribution of (-2)³ = -8 to the cumulative error. Otherwise they would work fine.
The math for squares (https://en.wikipedia.org/wiki/Simple_linear_regression) is easy, even when done by hand, and has some desirable properties (https://en.wikipedia.org/wiki/Simple_linear_regression#Numer...)
For linear models, least squares leads to the BLUE estimator: Best Linear Unbiassed Estimator. This acronym is doing a lot of work with each of the words having a specific technical meaning.
Fitting the model is also "nice" mathematically. It's a convex optimization problem, and in fact fairly straightforward linear algebra. The estimated coefficients are linear in y, and this also makes it easy to give standard errors and such for the coefficients!
Also, this is what you would do if you were doing Maximum Likelihood assuming Gaussian distributed noise in y, which is a sensible assumption (but not a strict assumption in order to use least squares).
Also, in a geometric sense, it means you are finding the model that puts its predictions closest to y in terms of Euclidean distance. So if you draw a diagram of what is going on, least squares seems like a reasonable choice. The geometry also helps you understand things like "degrees of freedom".
So, may overlapping reasons.
Best meaning the ‘least variance’, where variance is calculated based on the sum of squared residuals. There is a circularity in that definition.
Least squares is guaranteed to be convex [0]. At least for linear fit functions there is only one minimum and gradient descent is guaranteed to take you there (and you can solve it with a simple matrix inversion, which doesn't even require iteration).
Intuitively this is because a multidimensional parabola looks like a bowl, so it's easy to find the bottom. For higher powers the shape can be more complicated and have multiple minima.
But I guess these arguments are more about making the problem easy to solve. There could be applications where higher powers are worth the extra difficulty. You have to think about what you're trying to optimize.
[0] https://math.stackexchange.com/questions/483339/proof-of-con...
L1 (abs linear difference) is useful as minimizing on it gives an approximation of minimizing on L0 (count, aka maximizing sparsity). The reason for the substitution is that L1 has a gradient and so minimization can be fast with conventional gradient descent methods while minimizing L0 is a combinatoric problem and solving that is "hard". It is also common to add an L1 term to an L2 term to bias the solution to be sparse.
It all boils down to the fact that mean and variance give a good approximation of a probability distribution.
In the same way that things typically converge to the average they converge even more strongly to a normal distribution. So estimating noise as a normal distribution is often good enough.
The second order approximation is just really really good, and higher orders are nigh impossible to work with.
One way to think of it is that each point in your data follows your model but with gaussian iid noise shifting them away. The likelihood is then product of gaussians mean shifted and rescaled by variance. Minimize the log-likelihood then becomes reducing the sum of (x-mu)^2 for each point, which is essentially least squares.
Squares are preferred because that is the same as minimizing Euclidean Distance, which is defined as sqrt((x2-x1)^2).
sqrt((x2-x1)^2) == x2-x1
I think you meant sqrt(x^2+y^2)
Because the Euclidean norm is defined as the square root of a sum of squares. You can drop the square root and calculate everything as a least squares optimization problem. This problem in turn can be solved by finding where the derivative of the quadratic function is zero. The derivative of a quadratic function is a linear function. This means it is possible to find a matrix decomposition, say QR decomposition, and solve the problem directly.
If you want non linear optimization, your best bet is sequential quadratic programming. So even in that case you're still doing quadratic programming with extra steps.
I haven't done it in a while, but you can do cubes (and more) too. Cubes would be the L3 norm, something about the distance between circles (spheres?) in 3d space? I need to read about norms again to tell you why or when to choose that, but I know the Googlable term is "vector norms"
I remember one is Manhattan distance, next is as-the-crow-flies straight line distance, next is if you were a crow on the earth that can also swim in a straight line underwater, and so on