A/B Testing is a statistical method of comparing two versions (A and B) of a variable to determine which performs better against a defined metric. It involves splitting users into control and treatment groups, running the experiment, and using statistical tests (t-test, chi-squared) to determine if the difference is statistically significant.

A p-value is the probability of obtaining results at least as extreme as the observed results, assuming the null hypothesis is true. A small p-value (typically less than 0.05) indicates strong evidence against the null hypothesis, suggesting the result is statistically significant.

Top 25 Data Science Interview Questions 2026

Q: What Python libraries are essential for Data Science?

The core libraries are Pandas (data manipulation), NumPy (numerical computing), Matplotlib and Seaborn (visualization), Scikit-learn (ML), and Jupyter Notebooks (interactive development). In 2026, Polars is also gaining traction as a faster alternative to Pandas for large datasets.

Q: What is the difference between a DataFrame and a Series in Pandas?

A Series is a one-dimensional labeled array (like a single column), while a DataFrame is a two-dimensional labeled data structure with columns of potentially different types (like a table or spreadsheet). A DataFrame is essentially a collection of Series sharing the same index.

Q: What is the difference between Regression and Classification?

Regression predicts a continuous numerical output (e.g., house price, temperature), while Classification predicts a discrete category or class label (e.g., spam/not spam, disease diagnosis). Both are supervised learning techniques.

Section 1: Python for Data Science

Q1. What Python libraries are essential for Data Science?

The core Data Science Python stack in 2026 includes:

Pandas: The workhorse for data manipulation — loading, cleaning, transforming, and aggregating tabular data with DataFrames.
NumPy: Foundation for numerical computing — provides multi-dimensional arrays, linear algebra, and mathematical functions that Pandas and Scikit-learn build upon.
Matplotlib & Seaborn: Visualization libraries — Matplotlib is low-level and highly customizable; Seaborn provides beautiful statistical plots with less code.
Scikit-learn: The standard library for classical ML — classification, regression, clustering, preprocessing, and model evaluation.
Polars: A newer DataFrame library gaining adoption in 2026 — written in Rust, significantly faster than Pandas for large datasets.
Jupyter Notebooks: Interactive development environment for exploratory data analysis and presentation.

Q2. What is the difference between a list, tuple, set, and dictionary in Python?

List: Ordered, mutable, allows duplicates. [1, 2, 3]. Use for sequences you need to modify.
Tuple: Ordered, immutable, allows duplicates. (1, 2, 3). Use for fixed collections (function returns, dictionary keys).
Set: Unordered, mutable, no duplicates. {1, 2, 3}. Use for membership tests and removing duplicates — O(1) lookups.
Dictionary: Unordered key-value pairs, mutable, keys must be unique. {"name": "Alice"}. Use for fast lookups by key — the backbone of JSON-like data.

Interview tip: Always discuss time complexity. List lookup is O(n), dict/set lookup is O(1). This matters when processing millions of rows.

Q3. What is the difference between a deep copy and a shallow copy?

A shallow copy creates a new object but references the same nested objects. Changing a nested element in the copy affects the original. A deep copy creates a completely independent clone — all nested objects are also duplicated recursively.

In Pandas: df.copy() creates a deep copy by default (since Pandas 1.0+). In Python: use import copy; copy.deepcopy(obj) for nested structures. This is a common source of bugs in data pipelines where accidental mutation corrupts upstream data.

Section 2: Pandas & NumPy

Q4. What is the difference between a DataFrame and a Series in Pandas?

A Series is a one-dimensional labeled array — essentially a single column of data with an index. A DataFrame is a two-dimensional labeled data structure with rows and columns — like a spreadsheet or SQL table. A DataFrame is a collection of Series objects that share the same index.

Key distinction: operations on a Series return a Series; slicing a DataFrame by one column returns a Series, while slicing by multiple columns returns a DataFrame.

Q5. How do you handle missing data in Pandas?

Missing data (NaN/None) is one of the most common real-world data challenges. Pandas provides several strategies:

Detection: df.isnull().sum() — count missing values per column.
Removal: df.dropna() — drop rows/columns with missing values. Only appropriate when missing data is minimal and random (MCAR).
Imputation: df.fillna(df.mean()) — fill with mean, median, or mode. Use df.interpolate() for time-series data.
Forward/Backward Fill: df.ffill() or df.bfill() — propagate last valid value. Common in time-series and sensor data.
Advanced: Use Scikit-learn's IterativeImputer or KNNImputer for multivariate imputation that considers relationships between features.

Always analyze the missing data mechanism (MCAR, MAR, MNAR) before choosing a strategy — the wrong imputation method can introduce bias.

Q6. What is the difference between merge, join, and concat in Pandas?

merge(): SQL-style joins on columns or indexes. Supports inner, left, right, outer, and cross joins. Most flexible — pd.merge(df1, df2, on='id', how='left').
join(): Joins on index by default. Shorthand for merge() when joining on indexes — df1.join(df2, how='inner').
concat(): Stacks DataFrames vertically (axis=0) or horizontally (axis=1). Does not join on keys — just appends — pd.concat([df1, df2], axis=0).

Interview tip: Know when to use each. merge for relational joins, concat for stacking identical schemas, join when indexes are your keys.

Q7. What is vectorization in NumPy and why does it matter?

Vectorization means performing operations on entire arrays at once instead of looping through elements individually. NumPy operations are implemented in C, making vectorized code 10-100x faster than Python loops.

Example: np.array([1,2,3]) * 2 is vectorized — it multiplies all elements simultaneously. The equivalent Python loop [x*2 for x in [1,2,3]] is much slower at scale. In Data Science, always prefer vectorized Pandas/NumPy operations over iterrows() or apply() with lambda functions. For best performance in 2026, consider Polars which is vectorized by default and leverages multi-core parallelism.

Section 3: SQL for Data Science

Q8. What is the difference between WHERE and HAVING in SQL?

WHERE filters rows before grouping (applied to individual rows). HAVING filters groups after aggregation (applied to grouped results).

Example: "Find departments with more than 10 employees earning above 50K" —

SELECT department, COUNT(*) as emp_count
FROM employees
WHERE salary > 50000       -- filters rows first
GROUP BY department
HAVING COUNT(*) > 10;      -- filters groups after

You cannot use aggregate functions in WHERE — that is the fundamental difference.

Q9. Explain the different types of SQL JOINs.

INNER JOIN: Returns only rows with matching values in both tables.
LEFT JOIN (LEFT OUTER): Returns all rows from the left table, plus matching rows from the right. Non-matching right rows are NULL.
RIGHT JOIN: Opposite of LEFT — all rows from the right table.
FULL OUTER JOIN: Returns all rows from both tables. NULLs where no match exists on either side.
CROSS JOIN: Cartesian product — every row from table A paired with every row from table B. Use cautiously on large tables.
SELF JOIN: Joining a table with itself — common for hierarchical data (employees and managers).

Interview tip: Be ready to write JOIN queries on a whiteboard and explain the result set size for each type.

Q10. What are Window Functions in SQL?

Window functions perform calculations across a set of rows related to the current row — without collapsing the result into a single row like GROUP BY does. They use the OVER() clause.

Common window functions:

ROW_NUMBER(): Assigns a unique sequential number to each row within a partition.
RANK() / DENSE_RANK(): Ranks rows with ties — RANK skips numbers after ties, DENSE_RANK does not.
LAG() / LEAD(): Access previous/next row values — essential for time-series comparisons.
SUM() / AVG() OVER(): Running totals and moving averages without GROUP BY.

Window functions are heavily tested in Data Science interviews because they demonstrate advanced SQL fluency required for analytical queries.

Q11. Write a SQL query to find the second-highest salary in each department.

WITH ranked AS (
  SELECT
    employee_name,
    department,
    salary,
    DENSE_RANK() OVER (
      PARTITION BY department
      ORDER BY salary DESC
    ) AS rank
  FROM employees
)
SELECT employee_name, department, salary
FROM ranked
WHERE rank = 2;

This uses a CTE (Common Table Expression) with DENSE_RANK() partitioned by department. DENSE_RANK is preferred over ROW_NUMBER here because if two employees share the highest salary, we still want the actual second-highest value.

Section 4: Statistics & Probability

Q12. What is the difference between mean, median, and mode?

Mean: The arithmetic average — sum of all values divided by count. Sensitive to outliers (a single billionaire skews the mean income of a neighborhood).
Median: The middle value when data is sorted. Robust to outliers — preferred for skewed distributions (income, house prices).
Mode: The most frequently occurring value. Useful for categorical data (most popular product category).

Interview tip: When asked "which measure of central tendency should you use?" — always ask about the data distribution. If symmetric and no outliers, use mean. If skewed, use median. If categorical, use mode.

Q13. What is a p-value and what does statistical significance mean?

A p-value is the probability of observing results at least as extreme as the data, assuming the null hypothesis (H0) is true. It does NOT tell you the probability that H0 is true.

If p-value < 0.05 (the conventional threshold, called alpha), we reject H0 and declare the result statistically significant — meaning the observed effect is unlikely to be due to random chance alone.

Critical nuances interviewers expect you to know:

Statistical significance ≠ practical significance. A tiny effect can be statistically significant with enough data.
The 0.05 threshold is arbitrary — some fields use 0.01 or 0.001.
Multiple comparisons inflate false positive rates — apply Bonferroni correction or FDR control.

Q14. What is the Central Limit Theorem (CLT)?

The Central Limit Theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases — regardless of the original population's distribution. This typically holds for sample sizes of 30 or more.

Why it matters for Data Science: CLT is the foundation of confidence intervals, hypothesis testing, and A/B testing. It allows us to use normal distribution-based tests (z-test, t-test) even when the underlying data is not normally distributed, as long as sample sizes are sufficient.

Q15. What is Bayes' Theorem and where is it used?

Bayes' Theorem describes how to update the probability of a hypothesis based on new evidence:

P(A|B) = P(B|A) × P(A) / P(B)

P(A|B): Posterior probability — what we want to know (probability of A given evidence B).
P(B|A): Likelihood — probability of observing B if A is true.
P(A): Prior — our initial belief about A before seeing B.
P(B): Evidence — total probability of observing B.

Applications: spam filtering (Naive Bayes), medical diagnosis, recommendation systems, Bayesian A/B testing, and probabilistic programming. Bayesian methods are increasingly popular in 2026 because they naturally handle uncertainty and work well with small datasets.

Q16. What is the difference between Type I and Type II errors?

Type I Error (False Positive): Rejecting the null hypothesis when it is actually true. You conclude there is an effect when there is not. Controlled by the significance level (alpha).
Type II Error (False Negative): Failing to reject the null hypothesis when it is actually false. You miss a real effect. Controlled by statistical power (1 - beta).

Trade-off: Reducing Type I errors (lower alpha) increases Type II errors, and vice versa. In medical testing, Type II errors (missing a disease) are more dangerous. In spam filtering, Type I errors (marking real email as spam) are more costly. Always discuss the business context when explaining this trade-off.

Section 5: Regression & Classification

Q17. What is the difference between Regression and Classification?

Regression predicts a continuous numerical output — house prices, stock prices, temperature, revenue forecasts. Common algorithms: Linear Regression, Ridge, Lasso, Random Forest Regressor, XGBoost Regressor.

Classification predicts a discrete category or class label — spam/not spam, fraud detection, disease diagnosis, customer churn. Common algorithms: Logistic Regression, Decision Trees, Random Forest Classifier, SVM, XGBoost Classifier, Neural Networks.

Key evaluation metrics differ: Regression uses MAE, MSE, RMSE, R-squared. Classification uses Accuracy, Precision, Recall, F1-Score, AUC-ROC.

Q18. What is the Confusion Matrix and how do you interpret it?

A confusion matrix is a table that summarizes classification performance by showing predicted vs. actual labels:

True Positive (TP): Correctly predicted positive.
True Negative (TN): Correctly predicted negative.
False Positive (FP): Predicted positive but actually negative (Type I error).
False Negative (FN): Predicted negative but actually positive (Type II error).

Derived metrics: Precision = TP/(TP+FP) — "of all positive predictions, how many were correct?". Recall = TP/(TP+FN) — "of all actual positives, how many did we catch?". F1-Score = harmonic mean of Precision and Recall — balanced metric when classes are imbalanced.

Q19. What is Logistic Regression and when should you use it?

Despite its name, Logistic Regression is a classification algorithm, not a regression one. It models the probability that an input belongs to a particular class using the sigmoid (logistic) function, which maps any real number to a value between 0 and 1.

Use Logistic Regression when: you need a simple, interpretable baseline classifier; the relationship between features and log-odds is approximately linear; you need probability outputs (not just class labels); and you want fast training on large datasets. It is the go-to first model in most classification projects because it is easy to explain to stakeholders and provides a solid baseline against which to compare complex models.

Q20. What is the ROC Curve and AUC?

The ROC (Receiver Operating Characteristic) Curve plots the True Positive Rate (Recall) against the False Positive Rate at various classification thresholds. It visualizes the trade-off between sensitivity and specificity.

AUC (Area Under the Curve) is the total area under the ROC curve, ranging from 0 to 1. AUC = 0.5 means the model is no better than random guessing. AUC = 1.0 means perfect classification. AUC between 0.7-0.8 is considered acceptable; 0.8-0.9 is excellent; above 0.9 is outstanding.

AUC is threshold-independent — it measures overall model quality regardless of the specific decision boundary. This makes it especially useful when comparing models or when the optimal threshold is not yet determined.

Section 6: Data Visualization & A/B Testing

Q21. What chart type would you use for different data scenarios?

Bar Chart: Comparing categories (sales by region, product counts). Use horizontal bars when labels are long.
Line Chart: Showing trends over time (daily revenue, temperature changes).
Scatter Plot: Showing relationships between two continuous variables (height vs. weight, ad spend vs. conversions).
Histogram: Showing distribution of a single numerical variable (age distribution, salary ranges).
Box Plot: Comparing distributions across groups and spotting outliers (salary by department).
Heatmap: Showing correlation matrices or frequency tables (feature correlations, user activity by hour).
Pie Chart: Use sparingly — only for simple part-to-whole comparisons with few categories (market share).

Pro tip: In 2026, interactive dashboards (Power BI, Tableau, Plotly Dash) are expected. Static matplotlib plots are for EDA; production reports use interactive tools.

Q22. What is Exploratory Data Analysis (EDA) and what steps do you follow?

EDA is the process of analyzing and visualizing data to understand its structure, quality, and patterns before building models. A systematic EDA workflow:

1. Shape & Schema: df.shape, df.dtypes, df.info() — understand dimensions, types, and memory.
2. Missing Values: df.isnull().sum() — assess data completeness.
3. Descriptive Stats: df.describe() — mean, std, min, max, quartiles for numerical columns.
4. Distribution Analysis: Histograms, box plots — check for skewness, outliers, normality.
5. Correlation Analysis: df.corr() + heatmap — find feature relationships and multicollinearity.
6. Categorical Analysis: Value counts, bar charts — check class balance.
7. Target Relationship: Visualize how each feature relates to the target variable.

Q23. What is A/B Testing and how do you design one?

A/B Testing is a controlled experiment comparing two variants (A = control, B = treatment) to measure the impact of a change on a key metric.

Steps to design a proper A/B test:

1. Define Hypothesis: "Changing the CTA button color from blue to green will increase click-through rate."
2. Choose Metric: Primary metric (CTR), guardrail metrics (revenue, bounce rate).
3. Calculate Sample Size: Based on MDE (Minimum Detectable Effect), significance level (alpha=0.05), and power (0.8). Use power analysis.
4. Randomize: Randomly assign users to control/treatment groups. Ensure no selection bias.
5. Run Experiment: Let it run for sufficient duration (at least 1-2 full business cycles).
6. Analyze: Use t-test (continuous metric) or chi-squared test (proportions) to determine statistical significance.
7. Decide: If p-value < 0.05 and the effect is practically meaningful, ship the change.

Q24. What are common pitfalls in A/B Testing?

Peeking: Checking results before the experiment reaches required sample size — inflates false positive rate. Use sequential testing methods if you must check early.
Simpson's Paradox: A trend that appears in aggregated data reverses when data is segmented by a confounding variable. Always segment by key dimensions.
Novelty Effect: Users interact more with something new simply because it is new — not because it is better. Run tests longer to let the effect stabilize.
Network Effects: In social products, treating users independently can be wrong — users in control and treatment groups may influence each other.
Multiple Testing: Testing many variants or metrics simultaneously without correction leads to false discoveries. Apply Bonferroni or Holm-Bonferroni correction.
Insufficient Power: Running tests with too few users to detect small but meaningful effects. Always do a power analysis upfront.

Q25. What is Feature Engineering and why is it important?

Feature engineering is the process of creating new input variables from raw data to improve model performance. It is often the single most impactful step in a Data Science project — the saying "garbage in, garbage out" applies directly.

Common techniques:

Date Features: Extract day of week, month, hour, is_weekend from timestamps.
Aggregations: Calculate rolling averages, cumulative sums, group-level statistics.
Encoding: One-hot encoding for low-cardinality categoricals; target encoding or embedding for high-cardinality.
Interaction Features: Multiply or combine related features (price × quantity = revenue).
Binning: Convert continuous variables into ranges (age groups, income brackets).
Text Features: TF-IDF, word count, sentiment scores from text columns.
Log Transform: Reduce skewness in highly skewed distributions.

In 2026, automated feature engineering tools (Featuretools, tsfresh) help accelerate this process, but domain expertise remains irreplaceable.