# Mathematics for Intelligent Systems

**Mathematics for Intelligent Systems.** These documents cover a number of fundamental mathematical ideas and tools required for in-depth exploration of robotics, machine learning, optimization, and other intelligent systems. It start with a discussion of linear algebra from a geometric and coordinate-free viewpoint, and its relationship to advanced calculus and, in particular, the geometry of smooth mappings between spaces. It then moves into probability and statistical analysis, covering graphical models, some common statistical bounds, and decision theoretic frameworks.

**Linear Algebra I: Matrix Representations of Linear Transforms.****Linear Algebra II: The Fundamental Structure of Linear Transforms.****Probability II: Maximum Likelihood and Structure in Probability and Optimization**

Select homework implementations. They run in Octave and should also run in Matlab. See the top-level readme.txt file for details.

**Linear Algebra I: Matrix Representations of Linear Transforms.** Vector spaces exist independent of our choice of coordinate representation. Your house, the arrangement of its furniture, the cat sitting on the windowsill, or the glass falling from your hand toward its impending explosion across the floor, all exist independent of whether you choose to describe it all relative to a coordinate system aligned with your room or one aligned align with napping neighbor John at rest on his bed. This document introduces the notion of an abstract vector space and how coordinates, and ultimately matrices of linear transformations, arise from the underlying linearity structures imposed on the system.

**Linear Algebra II: The Fundamental Structure of Linear Transforms.** This document explores the fundamental underlying structure of linear transforms, ultimately showing how all linear transformations can be represented by a correspondence between orthonormal vectors in the domain and co-domain and an associated stretching along those axes. The structural properties of the linear transform are straightforwardly encoded in this representation, and it suggests a natural pseudoinverse for any linear transform that, in coordinates, manifests as the well-known Moore-Penrose pseudoinverse. The discussion culminates in a calculation that reveals that this fundamental underlying structure is exposed in coordinates by the Singular Value Decomposition of the transform's matrix given a choice of orthonormal basis for both the domain and co-domain.

**Linear Algebra III: Computations in Coordinates.** This document completes the step from abstract vector spaces to coordinates and introduces some computational masterpieces of matrix algebra. It starts with an exploration of the Singular Value Decomposition (SVD), and shows how many fundamental properties of the matrix manifest directly in the structure exposed by the computation. It then progresses to orthogonalization and the QR-decomposition, emphasizing how uniqueness of the decomposition enables us to approach the problem from vastly different directions while still ending up at the same result. And finally, it reviews Eigenvectors and Eigenvalues, and introduces the Power Method through one of its most famous applications, Google's PageRank algorithm. That discussion culminates in the presentation and analysis of the QR-algorithm for estimating the full Eigenspectrum of a matrix.

**Multivariate Calculus I: Derivatives and local geometry.** This document takes a first step beyond the tools of linear algebra and addresses the much larger space of nonlinear functions and maps that play a big role in many areas of machine learning, robotics, and optimization. We introduce the basics of higher-dimensional derivatives, focusing primarily on what they tell us about the local geometry of nonlinear maps and functions. Our journey takes us from Taylor approximations, which reveal geometry through connections back to the familiar tools of linear algebra, to the rote mechanics of derivative calculations. We discuss calculational tricks and tools emphasizing basic principles and post hoc geometric interpretation over formulas. And we explore in-depth the specific problem of calculating the positional Jacobian of a revolute manipulator to demonstrate how returning to the basic definition of what the Jacobian means allows us to side-step the typical cumbersome machinery of differential calculus and enable the direct derivation of a geometrically intuitive and computational effective algorithm for computing what has proven to be a critical mathematical tool in robotics.

**Multivariate Calculus II: The geometry of smooth maps. **This document explores the geometry of smooth maps, building many of the ideas from an analysis of attractor algorithms across surface representations. Through geometric intuition and algebraic analysis of these algorithms, we see how geometry may be pulled back from the co-domain of a smooth nonlinear map to construct a consistent representation of that geometry in the map's domain, and we study how we might construct algorithms that build on geometric or physical properties of a system so that their behavior is agnostic to the particular choice of representation. These tools and ideas originate in differential geometry and a study of Riemannian manifolds; this document emphasizes intuition over rigorous construction, particularly as demonstrated by the behavior of first-order Taylor expansions. In all cases, the map's Jacobian remains key to understanding how geometry transforms from one space to another.

**Optimization I: Theory and Analytical Methods.** Optimization is one of the most critical components of intelligent systems. Google uses optimization to learn search rankings and registering street view representations; financial analysts and hedge funds optimize strategies to make lots and lots of money; the world's most sophisticated robots, especially big bulky humanoids, are built around optimizers for control and motion generation, and even physical optimality principles that define their underlying dynamic models. This document presents first- and second-order criteria for understanding what's required of an optimal solution, and discusses their application to analytically solving various forms of optimization problems that arise frequently in practice. The material here in such a short document necessarily can only skim the surface of the available and widely applicable literature on optimization. One garners a deep understanding only through multiple classes, many books, and years of practical experience. But the analytical tools presented here are the building blocks for much of the theory, and they're a good introduction to this broad and fascinating area of study.

**Optimization II: Numerical Algorithms and Newton's Method.** Newton's method take center stage in this document as a workhorse for generating numerical optimization algorithms for both unconstrained nonlinear problems and constrained variants. This preliminary (unfinished) document on that body of work summaries some of the main result, building toward Sequential Quadratic Programming for equality constrained nonlinear problems which can be viewed as an application of Newton's method to solving the problem's KKT conditions.

**Probability I: The multi-dimensional basics.** Gaussian distributions are perhaps the most important distribution you'll ever encounter, not because they represent everything well (they're usually, by themselves, poor approximations to complex systems), but because they're one of the only high-dimensional distributions we can handle analytically. And because of that, many approaches to more complicated problems revolve around reducing the problems to Gaussian approximations or sequences of Gaussian approximations to leverage the tractable algebra of Gaussian inference, which, to a large extent, itself reduces to linear algebra. That's not to say Gaussian manipulations are easy--- in many ways the algebra can be tedious--- but having a thorough understanding of their properties, their relation to quadratic functions and linear systems, and their manipulation in the context of probabilistic inference for such queries as the transition and observation updates of a Kalman filter is crucial for a strong foundation for further study within the uncertainty-laden domain of state-estimation, localization, mapping, and sensor processing in mobile robotics.

(Originally titled: Probabilistic inference, Gaussians, quadratics, and the Kalman filter)

**Probability II: Maximum Likelihood and Structure in Probability and Optimization.** This document explores formal representations of structure in probability distributions and how algorithms can exploit that structure for efficient large-scale inference. We briefly introduce directed and undirected Graphical Models as well as Factor Graphs, but focuses mostly heavily on undirected representations of Gibbs distributions. We explore, in particular, how Maximum A Posteriori (MAP) estimation reduces to structured optimization, and review Gradient Descent, Jacobi Iteration, Gauss-Seidel, and Conjugate Gradient methods for solving the underlying sparse linear systems that arise during the computation of Newton steps. This document ends with a quick introduction to Maximum Likelihood in statistics, focusing on how it manifests in estimating Gibbs distributions from data.

**Information Geometry and Natural Gradients. **This document reviews some of the basic concepts behind natural gradients. We start by introducing basic information theoretic concepts such as optimal codes, entropy, and the KL-divergence. We then demonstrate how the choice of metric on perturbations can significantly affect the performance of gradient descent algorithms. Within that discussion, we review the Method of Lagrange Multipliers for equality constrained optimization as well as intuitive interpretations of the action of positive definite matrices and their inverses and how that relates to their role in generalized gradient descent updates. Finally, we review how the second-order Taylor expansion of the KL-divergence between a distribution and a slightly perturbed version of that distribution leads to the notion of the Fisher Information as a natural metric on the manifold of probability distributions.

**Basic Statistical Bounds: From Markov to Hoeffding.** This document explores some statistical bounds that have found a number of applications in Intelligent Systems, especially in Machine Learning. We start off discussing the the basic notion of an empirical mean estimator as a random variable and derive Markov's inequality and Chebyshev's inequality as stepping stones toward proving the Weak Law of Large Numbers. We then build on these results to prove Hoeffding's inequality, which relates the probability that an empirical mean estimator deviates from the true mean to characteristics of the domain and the number of samples. The final sections explore how to interpret and manipulate Hoeffding's inequality to answer important practical questions such as how many samples we'd need to guaranteed a certain confidence that the mean estimator falls close to the true mean.

**Sequential Decisions: Solving Markov Decision Processes.** Sequential decisions are the building blocks of intelligence. Intelligent agents must reason about future consequences of every action and optimize their behavior accordingly to discover high performing policies. This document introduces Markov Decision Processes (MDPs), a tractable mathematical formulation of the sequential decision problem, and three algorithms for solving them: Value Iteration, Policy Iteration, and Linear Programming. It additionally presents a form of Soft Value Iteration that enables tractable manipulation of Gibbs distributions over trajectories in these settings. These tools are some of the most basic constructs constituting the heavily researched field of Decision Theory. Throughout, we see numerous connections to core areas of mathematics such as Linear Algebra and Optimization, from a linear system whose solution reveals the cumulative accrued reward of policies acting over time, to an instance of the Power Method derived as an algorithm for calculating stochastic policy distributional normalizers. This exposition is necessarily incomplete, but these ideas are the catalyst for many successful areas of Robotics and Machine Learning dedicated to the design and implementation of intelligent machines.