Four interactive demos and a full presentation deck to build intuitive understanding of core probability concepts — the mathematical foundation of Artificial Neural Networks.
Having an intersection does not automatically mean independence. It depends on exactly how much the circles overlap — independence requires P(A∩B) = P(A)×P(B) precisely.
CONCEPTUAL MAP
All Event Relationships
├── ● Independent→ overlap exists, but P(A∩B) = P(A)×P(B) exactly. Knowing B tells you nothing about A.
└── ● Dependent→ knowing one event changes the probability of the other
├── ◌ Mutually Exclusive→ no intersection. If B happened, A definitely did NOT.
├── ◌ Overlap — wrong size→ circles intersect but P(A∩B) ≠ P(A)×P(B)
└── ◌ Subset→ B inside A. If B happened, A definitely happened.
Mutually Exclusive — A special case of Dependence
No intersection. If B happens, A cannot. This is dependence.
Adjust Probabilities
P(A) 0.40
P(B) 0.30
P(A ∩ B) — actual overlap 0.10
QUICK SET
💡 Goal: Set P(A∩B) = P(A)×P(B) for independence.
The green marker on the bar shows exactly where that is.
Probabilities
P(A)0.40
P(B)0.30
P(A ∩ B) — actual0.00
P(A)×P(B) — if independent0.12
P(A ∪ B)0.70
P(A | B)0.00
P(B | A)0.00
Independence check — actual overlap vs expected
■ Actual P(A∩B)| Expected P(A)×P(B)
Formula
—
—
DEMO 02 — INTERACTIVE
Bayes' Theorem
How should you update your belief after seeing new evidence? Adjust the sliders and watch the posterior probability change in real time.
Flip a coin thousands of times and watch the observed frequency converge to the true probability. Chaos becomes certainty — given enough trials.
Speed:
Coin bias: Fair (50%)
TOTAL FLIPS
0
HEADS
0
OBSERVED P(H)
—
TRUE P(H)
50%
?
Last 30 flips will appear here...
Convergence to True Probability
Observed P(H)True P(H)
Law of Large Numbers: As the number of trials increases, the sample mean approaches the expected value (true probability). Start flipping to see the convergence happen in real time.
DEMO 04 — INTERACTIVE
Softmax Function
Softmax converts raw scores (logits) from a neural network into a proper probability distribution — all values between 0 and 1, summing to exactly 1. It's the final step in almost every classifier.
z = raw scores (logits) · T = temperature (default 1.0)
Output: probability distribution over classes
Why exp()? — Softmax vs. Simple Normalization
Simple normalization (dividing each score by the total) also produces values that sum to 1 — but it treats scores linearly. Softmax uses exp(), which amplifies differences: the highest score gets a disproportionately larger probability, making the model more decisive. This is critical for training via cross-entropy loss.