Machine Learning Interview Questions

Question 1

What is the difference between supervised and unsupervised learning?

Answer

Supervised learning trains models on labeled data (input-output pairs) to predict outcomes for new inputs, and includes tasks like classification and regression. Unsupervised learning finds patterns in unlabeled data without predefined outputs, including clustering, dimensionality reduction, and anomaly detection. Semi-supervised learning combines both by using a small amount of labeled data with a larger pool of unlabeled data.

Question 2

Explain the bias-variance tradeoff.

Answer

Bias is the error from oversimplified assumptions that cause a model to underfit (miss relevant patterns). Variance is the error from sensitivity to training data fluctuations that cause overfitting (capturing noise). The tradeoff exists because reducing one typically increases the other. The goal is to find the sweet spot that minimizes total error (bias^2 + variance + irreducible error) on unseen data.

Question 3

What is gradient descent and how does it work?

Answer

Gradient descent is an optimization algorithm that iteratively adjusts model parameters by moving in the direction of the steepest decrease of the loss function. The learning rate controls the step size: too large causes divergence, too small causes slow convergence. Variants include batch (full dataset per step), stochastic (one sample per step), and mini-batch (subset per step). Advanced optimizers like Adam adapt learning rates per parameter.

Question 4

What is the difference between precision and recall?

Answer

Precision is the ratio of true positives to all predicted positives (how many of the positive predictions were correct). Recall is the ratio of true positives to all actual positives (how many real positives the model found). High precision means few false positives; high recall means few false negatives. The F1 score is their harmonic mean, providing a single metric that balances both.

Question 5

How do neural networks learn through backpropagation?

Answer

Backpropagation computes the gradient of the loss function with respect to each weight by applying the chain rule layer by layer from the output back to the input. These gradients indicate how much each weight contributed to the error. The weights are then updated using gradient descent to minimize the loss. This forward-pass-then-backward-pass cycle repeats over many epochs until the model converges.

Question 6

What is overfitting and how do you prevent it?

Answer

Overfitting occurs when a model learns noise and specific patterns in the training data that do not generalize to unseen data, resulting in high training accuracy but poor test performance. Prevention techniques include regularization (L1/L2), dropout, early stopping, data augmentation, cross-validation, reducing model complexity, and increasing training data. Monitoring the gap between training and validation loss helps detect overfitting early.

Question 7

Explain the difference between a CNN and an RNN.

Answer

Convolutional Neural Networks (CNNs) use convolutional filters to detect spatial patterns and hierarchical features, making them ideal for images, video, and grid-structured data. Recurrent Neural Networks (RNNs) process sequential data by maintaining hidden states that carry information across time steps, suited for text, speech, and time series. CNNs share weights spatially (across filter positions), while RNNs share weights temporally (across time steps).

Question 8

What is the Transformer architecture and why is it important?

Answer

The Transformer uses self-attention mechanisms to process all positions in a sequence simultaneously rather than sequentially like RNNs, enabling massive parallelization and better modeling of long-range dependencies. It consists of encoder and decoder stacks with multi-head attention, feed-forward layers, and residual connections. Transformers are the foundation of modern NLP models (BERT, GPT) and have been adapted for vision, audio, and multi-modal tasks.

Question 9

What is cross-validation and why is it useful?

Answer

Cross-validation splits the dataset into k folds, trains the model on k-1 folds, validates on the remaining fold, and repeats k times so each fold serves as the validation set once. The results are averaged to provide a more reliable estimate of model performance than a single train-test split. It reduces the impact of data splitting randomness and helps detect overfitting, especially valuable when the dataset is small.

Question 10

What is the difference between bagging and boosting?

Answer

Bagging (Bootstrap Aggregating) trains multiple models independently on random subsets of the data and combines their predictions (averaging for regression, voting for classification), reducing variance. Random Forest is a popular bagging method. Boosting trains models sequentially, with each new model focusing on the errors of the previous ones, reducing bias. XGBoost and AdaBoost are common boosting algorithms. Bagging reduces variance; boosting reduces bias.

Question 11

How does a Support Vector Machine (SVM) work?

Answer

SVM finds the hyperplane that maximizes the margin (distance) between two classes in feature space. Support vectors are the closest data points to the decision boundary that define the margin. For non-linearly separable data, the kernel trick maps inputs into a higher-dimensional space where a linear separator exists. Common kernels include linear, polynomial, and RBF (Gaussian).

Question 12

What is transfer learning?

Answer

Transfer learning reuses a model trained on a large dataset (source task) as the starting point for a different but related task (target task), significantly reducing the amount of labeled data and training time needed. Typically, early layers that capture general features are frozen while later layers are fine-tuned for the specific task. It is widely used in NLP (fine-tuning BERT, GPT) and computer vision (fine-tuning ImageNet-pretrained models).

Question 13

What is the vanishing gradient problem?

Answer

The vanishing gradient problem occurs when gradients become extremely small as they propagate backward through many layers, causing early layers to learn very slowly or not at all. This is especially problematic with sigmoid and tanh activations in deep networks. Solutions include using ReLU activations, residual connections (skip connections), batch normalization, LSTM/GRU cells for recurrent networks, and careful weight initialization schemes like Xavier or He initialization.

Question 14

Explain word embeddings and Word2Vec.

Answer

Word embeddings are dense, low-dimensional vector representations of words where semantically similar words are mapped to nearby points. Word2Vec learns these embeddings using either CBOW (predicting a word from its context) or Skip-gram (predicting context from a word) on large text corpora. The resulting vectors capture semantic relationships: for example, vector('king') - vector('man') + vector('woman') approximates vector('queen'). Modern approaches like BERT use contextual embeddings that vary based on surrounding words.

Question 15

What is the ROC curve and AUC score?

Answer

The ROC (Receiver Operating Characteristic) curve plots the true positive rate against the false positive rate at various classification thresholds. AUC (Area Under the Curve) summarizes the ROC curve as a single number between 0 and 1, where 1 represents perfect classification and 0.5 represents random guessing. AUC is useful because it evaluates model performance across all thresholds rather than at a single point, making it threshold-independent and robust for imbalanced datasets.