Machine Learning Interview Questions

Mixt

Prepare for machine learning interviews with questions on regression, classification, neural networks, NLP, model evaluation, and core ML concepts.

15 preguntes|
4 fàcil
8 mitjà
3 difícil

Supervised learning trains models on labeled data (input-output pairs) to predict outcomes for new inputs, and includes tasks like classification and regression. Unsupervised learning finds patterns in unlabeled data without predefined outputs, including clustering, dimensionality reduction, and anomaly detection. Semi-supervised learning combines both by using a small amount of labeled data with a larger pool of unlabeled data.

fundamentalslearning-types

Bias is the error from oversimplified assumptions that cause a model to underfit (miss relevant patterns). Variance is the error from sensitivity to training data fluctuations that cause overfitting (capturing noise). The tradeoff exists because reducing one typically increases the other. The goal is to find the sweet spot that minimizes total error (bias^2 + variance + irreducible error) on unseen data.

theorymodel-evaluation

Gradient descent is an optimization algorithm that iteratively adjusts model parameters by moving in the direction of the steepest decrease of the loss function. The learning rate controls the step size: too large causes divergence, too small causes slow convergence. Variants include batch (full dataset per step), stochastic (one sample per step), and mini-batch (subset per step). Advanced optimizers like Adam adapt learning rates per parameter.

optimizationtraining

Precision is the ratio of true positives to all predicted positives (how many of the positive predictions were correct). Recall is the ratio of true positives to all actual positives (how many real positives the model found). High precision means few false positives; high recall means few false negatives. The F1 score is their harmonic mean, providing a single metric that balances both.

evaluationclassification

Backpropagation computes the gradient of the loss function with respect to each weight by applying the chain rule layer by layer from the output back to the input. These gradients indicate how much each weight contributed to the error. The weights are then updated using gradient descent to minimize the loss. This forward-pass-then-backward-pass cycle repeats over many epochs until the model converges.

neural-networkstraining

Overfitting occurs when a model learns noise and specific patterns in the training data that do not generalize to unseen data, resulting in high training accuracy but poor test performance. Prevention techniques include regularization (L1/L2), dropout, early stopping, data augmentation, cross-validation, reducing model complexity, and increasing training data. Monitoring the gap between training and validation loss helps detect overfitting early.

overfittingregularization

Convolutional Neural Networks (CNNs) use convolutional filters to detect spatial patterns and hierarchical features, making them ideal for images, video, and grid-structured data. Recurrent Neural Networks (RNNs) process sequential data by maintaining hidden states that carry information across time steps, suited for text, speech, and time series. CNNs share weights spatially (across filter positions), while RNNs share weights temporally (across time steps).

neural-networksarchitecture

The Transformer uses self-attention mechanisms to process all positions in a sequence simultaneously rather than sequentially like RNNs, enabling massive parallelization and better modeling of long-range dependencies. It consists of encoder and decoder stacks with multi-head attention, feed-forward layers, and residual connections. Transformers are the foundation of modern NLP models (BERT, GPT) and have been adapted for vision, audio, and multi-modal tasks.

neural-networksnlp

Cross-validation splits the dataset into k folds, trains the model on k-1 folds, validates on the remaining fold, and repeats k times so each fold serves as the validation set once. The results are averaged to provide a more reliable estimate of model performance than a single train-test split. It reduces the impact of data splitting randomness and helps detect overfitting, especially valuable when the dataset is small.

evaluationvalidation

Bagging (Bootstrap Aggregating) trains multiple models independently on random subsets of the data and combines their predictions (averaging for regression, voting for classification), reducing variance. Random Forest is a popular bagging method. Boosting trains models sequentially, with each new model focusing on the errors of the previous ones, reducing bias. XGBoost and AdaBoost are common boosting algorithms. Bagging reduces variance; boosting reduces bias.

ensemblemethods

SVM finds the hyperplane that maximizes the margin (distance) between two classes in feature space. Support vectors are the closest data points to the decision boundary that define the margin. For non-linearly separable data, the kernel trick maps inputs into a higher-dimensional space where a linear separator exists. Common kernels include linear, polynomial, and RBF (Gaussian).

classificationsvm

Transfer learning reuses a model trained on a large dataset (source task) as the starting point for a different but related task (target task), significantly reducing the amount of labeled data and training time needed. Typically, early layers that capture general features are frozen while later layers are fine-tuned for the specific task. It is widely used in NLP (fine-tuning BERT, GPT) and computer vision (fine-tuning ImageNet-pretrained models).

techniquestraining

The vanishing gradient problem occurs when gradients become extremely small as they propagate backward through many layers, causing early layers to learn very slowly or not at all. This is especially problematic with sigmoid and tanh activations in deep networks. Solutions include using ReLU activations, residual connections (skip connections), batch normalization, LSTM/GRU cells for recurrent networks, and careful weight initialization schemes like Xavier or He initialization.

neural-networkstraining

Word embeddings are dense, low-dimensional vector representations of words where semantically similar words are mapped to nearby points. Word2Vec learns these embeddings using either CBOW (predicting a word from its context) or Skip-gram (predicting context from a word) on large text corpora. The resulting vectors capture semantic relationships: for example, vector('king') - vector('man') + vector('woman') approximates vector('queen'). Modern approaches like BERT use contextual embeddings that vary based on surrounding words.

nlpembeddings

The ROC (Receiver Operating Characteristic) curve plots the true positive rate against the false positive rate at various classification thresholds. AUC (Area Under the Curve) summarizes the ROC curve as a single number between 0 and 1, where 1 represents perfect classification and 0.5 represents random guessing. AUC is useful because it evaluates model performance across all thresholds rather than at a single point, making it threshold-independent and robust for imbalanced datasets.

evaluationclassification