MI: The Feature Selection Method Nobody Uses

We learned correlation. Rank features by Pearson r. Keep the big ones.

But correlation only sees linear signal. We know that…

Y = X → high r
Y = X² → ~0
Y = sin(X) → ~0

Strong signal. Zero correlation. Hmmm… 🤔

Enter Mutual Information (MI)

One question: How much does knowing X reduce uncertainty about Y?

MI catches linear and nonlinear signal. Decision trees were built on it. Feature selection almost never uses it.

from sklearn.feature_selection import mutual_info_classif
mi = mutual_info_classif(X, y)

Compare MI vs correlation rankings. Different features on top? You've been missing signal.

The Tradeoff

Correlation is fast and directional but blind to nonlinear. MI is slower, directionless, but catches everything. If correlation missed it and MI caught it, direction didn't exist. That's why MI matters.

MI has been around since 1948. Try it.

quique@databirds.ai

← Back to Blog

MI: The Feature Selection Method Nobody Uses

(But Should)

Enter Mutual Information (MI)

The Tradeoff