Many of us have heard about Lasso and its ability to bring sparsity to models, but not everyone understands the nitty-gritty of how it actually works. In a nutshell, Lasso is like a superhero for overfitting problems, tackling them through a technique called regularization. If you’re not familiar with regularization and how it fights overfitting, I’d recommend checking that out first. For now, let’s dive into the magic of how Lasso brings sparsity.
Lasso’s cost function helps find the best model by balancing two goals: making accurate predictions (mean squared error) and keeping the number of factors in check (absolute values of coefficients). It does this by adding a penalty that encourages simpler, more focused models.

The objective of Lasso is to find the values of β that minimize this combined loss function, striking a balance between fitting the data well (MSE) and keeping the magnitude of the coefficients small (∣βi∣). This helps in preventing overfitting and encourages sparsity in the model by pushing some coefficients to exactly zero, effectively performing variable selection.
Now lets understand how exactly it brings sparsity. For simplicity lets just assume we have only two coefficients β1 and β2 in a 2D plane. which gives us an equation |β1| + |β2| ≤ t

Think of it as trying to make these coefficients as small as possible, but there’s a twist. We introduce a constraint that puts a limit on the total absolute values of these coefficients. It’s like saying, ‘Hey, keep the sum of these coefficients below a certain threshold.’ This threshold is denoted as ‘t,’ and its size matters — a larger ‘t’ allows for larger coefficients, and it’s inversely proportional to λ.
To understand it better lets assume the same in a 3D |β1| + |β2| + |β3|≤ t . Now we get the essence right, as the number of dimensions keep increasing our graph becomes more pointy at each point, on the plane we have one or more coefficients becoming zero. Say the point on Z axis results in β1 and β2 to zero and so on.

Visualizing the loss function (MSE) graph, it becomes apparent that the Lasso constraint, with its sharp corners, is more likely to intersect the ellipses representing the Residual Sum of Squares at points where some coefficients become zero. This unique constraint shape of Lasso, with its sharp corners, effectively pushes certain coefficients to zero, introducing simplicity to the model that other techniques might not achieve.

So, in a nutshell, Lasso’s unique constraint shape, with its sharp corners, makes it great at pushing some coefficients to exactly zero, introducing a simplicity to the model that other techniques might not achieve.”