Walmart Sales Data Analysis

Overview
This project investigates how demographic factors — specifically gender, age group, and marital status — influence customer spending behavior using Walmart's Black Friday transactional dataset. The analysis applies statistical methods including Exploratory Data Analysis, bootstrapping, and confidence interval estimation to derive meaningful insights from a large, real-world retail dataset.
The central business question driving this analysis was whether spending habits differ significantly between male and female customers, and whether age or marital status play a role in transaction value. The findings are intended to inform customer segmentation strategies and marketing decisions for large-scale retailers.
This project was completed as the final project for STAT 5000: Statistical Methods and Applications I at the University of Colorado Boulder.
Dataset
The dataset used is the Walmart Black Friday Sales dataset sourced from Kaggle, collected through Walmart's point-of-sale systems. It contains 550,068 transactional records across 10 columns including User_ID, Gender, Age, Occupation, Marital_Status, City_Category, Product_ID, Product_Category, and Purchase amount.
The dataset required preprocessing before analysis. Categorical columns including User_ID, Occupation, Marital_Status, and Product_Category were converted to object types for appropriate handling. The Marital_Status column was re-mapped from binary indicators to descriptive labels — 'Married' and 'Unmarried' — to improve interpretability.
An outlier analysis using the IQR method identified an upper bound of $21,400.50. Approximately 0.48% of transactions exceeded this threshold. These high-value records were retained in the dataset as they represent genuine purchasing behavior relevant to revenue modeling.
Research Questions
The analysis was structured around three primary research questions:
- Does gender play a statistically significant role in per-transaction purchase amounts at Walmart?
- Is there a meaningful correlation between customer age group and spending behavior?
- Can the average spending range for each demographic group be reliably estimated using bootstrapped confidence intervals?
Methodology
The purchase variable exhibited strong positive skewness with a long right tail, violating the normality assumption required by many parametric tests. To address this, bootstrapping was applied — a resampling technique that leverages the Central Limit Theorem to produce approximately normal sampling distributions regardless of the underlying population shape.
For each demographic group, 10,000 bootstrap iterations were performed at sample sizes of n=1,000, n=2,500, and n=5,000. For each iteration, a random sample with replacement was drawn and the sample mean computed. The resulting distribution of means was used to calculate 90%, 95%, and 99% confidence intervals.
The CLT was validated empirically: as sample size increased from 1,000 to 5,000, the bootstrapped distributions became progressively more symmetric and bell-shaped, and the standard error decreased — confirming the reliability of parametric inference on this dataset.
Key Findings
The analysis produced statistically clear results across the three demographic dimensions studied.
- Gender and spending: Male customers account for 75.3% of transactions. The 95% confidence interval for male spending ($9,122–$9,757) does not overlap with the female interval ($8,438–$9,030). This non-overlap provides strong statistical evidence that male customers spend significantly more per transaction on average.
- Marital status and spending: The 95% confidence intervals for married ($8,946–$9,568) and unmarried ($8,952–$9,573) customers overlap almost entirely. There is no statistically significant difference in average transaction value between these groups.
- Age group and spending: The 26–35 age group exhibited the highest average spending, with confidence intervals slightly above those of younger and older cohorts. While many age groups show overlapping intervals, the trend indicates spending peaks in early to mid-adulthood. The 0–17 group showed the lowest spending; the 26–35 group showed the highest.
- Product preferences: Female customers predominantly favor product category 5, while male customers prefer category 1. Both married and unmarried customers share a preference for product category 5.
Technical Stack
The entire analysis was implemented in Python within a Jupyter Notebook environment on Kaggle, taking advantage of the platform's GPU-backed compute for running 10,000 bootstrap iterations across multiple demographic groups.
Core libraries used: NumPy for numerical operations and bootstrapping logic, Pandas for data loading and preprocessing, Matplotlib and Seaborn for all visualizations including histograms, KDE plots, boxplots, and bar charts, SciPy for statistical utility functions, and Plotly for interactive chart exploration.
No machine learning models were used. The analytical approach was deliberately grounded in classical statistical inference — bootstrapping and confidence intervals — to produce interpretable and defensible results from a skewed population distribution.
Conclusions & Recommendations
Gender emerged as the strongest demographic predictor of average transaction value in this dataset. The non-overlapping confidence intervals provide robust statistical support for this conclusion. Retailers can use this insight to tailor marketing campaigns, product placement, and promotional strategies by gender segment.
Marital status, by contrast, showed no meaningful influence on spending. The near-identical confidence intervals for married and unmarried groups suggest that segmenting customers by marital status would not yield differentiated marketing outcomes and that resources are better spent on other segmentation dimensions.
The bootstrapping approach proved critical to the validity of the analysis. By applying the CLT to a skewed population, the team was able to use parametric inference methods that would otherwise be inappropriate, and the results are supported by consistent behavior across multiple sample sizes.
Future Work
Several directions could extend this analysis meaningfully.
Interaction effects between demographic variables were not explored in this study. A regression model incorporating gender, age, marital status, and their interactions could quantify how these factors jointly influence transaction value and identify any compounding effects.
The dataset's product category dimension offers another avenue for deeper segmentation analysis. Understanding which product categories drive spending differences across demographics could yield more actionable inventory and merchandising insights.
Applying time-series analysis to understand whether seasonal patterns or external events like holidays amplify demographic spending differences would also be a valuable extension to this work.