Machine learning insights into Shopify product tag organization
Data sourced from cantbuymelove.industrial-linguistics.com powering Shopify taxonomy classification and filtered to taxonomies with at least five products.
Training data spans 6,108 products across 368 taxonomies. Of 24,460 total tags in the dataset, 9,365 tags were used (tags appearing fewer than 5 times were filtered out). 4,184 products were discarded due to missing or sparse taxonomy labels. Explore the full dataset → | View defective taxonomy labels →
Always predicts most common taxonomy (baseline for comparison)
P-adic coefficients assigned to tags to predict taxonomy
Stochastic p-adic optimization starting from UMLLR (arXiv:2503.23488)
Stochastic p-adic optimization starting from zeros (arXiv:2503.23488)
Mahler affine basis (degree 1) with UMLLR initialization
Mahler quadratic basis (degree 2) with UMLLR initialization
L1-regularized model using ALL tags
Unconstrained tree using ALL tags
L1-regularized NN with weight pruning
Neural network predicting taxonomy from tags
Logistic regression model predicting Shopify taxonomy from tags
Battle-tested tag hierarchy from product title positions
| Taxonomy ID | Name | Path | Samples | Share |
|---|---|---|---|---|
| gid://shopify/TaxonomyCategory/aa-1-13-8 | Apparel & Accessories > Clothing > Clothing Tops > T-Shirts | 1.1.13.8 | 304 | 5.0% |
| gid://shopify/TaxonomyCategory/fb-2-3-2 | Food, Beverages & Tobacco > Food Items > Candy & Chocolate > Chocolate | 9.2.3.2 | 249 | 4.1% |
| gid://shopify/TaxonomyCategory/aa-1-4 | Apparel & Accessories > Clothing > Dresses | 1.1.4 | 142 | 2.3% |
| gid://shopify/TaxonomyCategory/aa-6-8 | Apparel & Accessories > Jewelry > Necklaces | 1.6.8 | 142 | 2.3% |
| gid://shopify/TaxonomyCategory/ae-2-1 | Arts & Entertainment > Hobbies & Creative Arts > Arts & Crafts | 3.2.1 | 130 | 2.1% |
| gid://shopify/TaxonomyCategory/aa-6-6 | Apparel & Accessories > Jewelry > Earrings | 1.6.6 | 118 | 1.9% |
| gid://shopify/TaxonomyCategory/hg-9 | Home & Garden > Household Appliances | 14.9 | 105 | 1.7% |
| gid://shopify/TaxonomyCategory/ha-6-2-5 | Hardware > Hardware Accessories > Cabinet Hardware > Cabinet Knobs & Handles | 12.6.2.5 | 89 | 1.5% |
| gid://shopify/TaxonomyCategory/lb | Luggage & Bags | 15 | 81 | 1.3% |
| gid://shopify/TaxonomyCategory/ae-2-2 | Arts & Entertainment > Hobbies & Creative Arts > Collectibles | 3.2.2 | 79 | 1.3% |
| Tag | Top taxonomy | Weight | Max |weight| |
|---|---|---|---|
| FRAMED ARTWORK | 3.2.2 | 5.8150 | 5.8150 |
| BLUE | 14.11.10.4.3 | 5.4904 | 5.4904 |
| WOMENS | 1.8.7 | 5.4341 | 5.4341 |
| ACCESSORIES | 1.2.4 | 5.2483 | 5.2483 |
| GIFT | 14.15.1.9 | 5.0948 | 5.0948 |
| WHOLESALE | 14.11.10.7.9 | 5.0544 | 5.0544 |
| VEGAN | 13.3.5.2 | 5.0324 | 5.0324 |
| KIDS | 13.1.20 | 4.9532 | 4.9532 |
| NEW ARRIVALS | 13.3.2.8.4 | 4.8409 | 4.8409 |
| PLUS SIZE | 1.1.1.1.5 | 4.8031 | 4.8031 |
Tracking model performance and dataset growth over time. Lower p-adic loss indicates better predictions.
| Model | Slope (per product) | Intercept | R² | p-value |
|---|---|---|---|---|
| Importance-Optimised p-adic LR | 0.000012 | 0.2991 | 0.3005 | 8.13e-08 |
| PCLR | 0.000082 | 0.2835 | 0.7514 | 3.32e-26 |
| PCNN | 0.000082 | 0.2458 | 0.8471 | 8.97e-35 |
| ULR | 0.000009 | 0.1766 | 0.2355 | 2.00e-04 |
| UNN | 0.000027 | 0.0671 | 0.7473 | 1.50e-16 |
| Decision Tree | 0.000009 | 0.1401 | 0.2973 | 3.52e-05 |
| Zubarev (UMLLR) | 0.000020 | 0.3037 | 0.8125 | 4.01e-16 |
| Zubarev (zeros) | 0.000027 | 0.2936 | 0.8331 | 3.85e-17 |
| Zubarev (M1) | 0.000007 | 0.3744 | 0.3632 | 2.41e-05 |
| Zubarev (M2) | 0.000011 | 0.3547 | 0.5395 | 3.08e-08 |
| Dummy Baseline | -0.000073 | 1.0940 | 0.5802 | 4.63e-14 |
Based on current regression trends, we can extrapolate when Importance-Optimised p-adic LR will achieve better performance (lower p-adic loss) than other models as the dataset grows. The confidence intervals are calculated using bootstrap resampling (n=1000).
| Model | Crossover Point (products) |
95% Confidence Interval | Probability | Estimated Date |
|---|---|---|---|---|
| UNN (Unconstrained Neural Networks) | 14,983 | 11,806 - 21,237 (95% CI, σ=2,567) | >95% | 2026-07-09 (±uncertain, R²=0.997, growth=56.3/product/day) |
Statistical Notes: The crossover points are calculated by finding where the regression lines intersect. The 95% confidence intervals are derived from bootstrap resampling of the regression parameters. The probability estimates indicate the likelihood that the crossover will occur given the current trends. Date predictions are based on linear extrapolation of dataset growth and should be interpreted with caution.
| Model | Slope (per tag) | Intercept | R² | p-value |
|---|---|---|---|---|
| Importance-Optimised p-adic LR | 0.000014 | 0.2424 | 0.3331 | 1.12e-08 |
| PCLR | 0.000090 | -0.0697 | 0.7355 | 4.17e-25 |
| PCNN | 0.000090 | -0.1065 | 0.8234 | 3.11e-32 |
| ULR | 0.000009 | 0.1442 | 0.2658 | 6.61e-05 |
| UNN | 0.000027 | -0.0256 | 0.7661 | 2.15e-17 |
| Decision Tree | 0.000009 | 0.1069 | 0.3271 | 1.16e-05 |
| Zubarev (UMLLR) | 0.000021 | 0.2276 | 0.8448 | 8.96e-18 |
| Zubarev (zeros) | 0.000028 | 0.1928 | 0.8509 | 4.01e-18 |
| Zubarev (M1) | 0.000008 | 0.3465 | 0.3708 | 1.87e-05 |
| Zubarev (M2) | 0.000011 | 0.3161 | 0.5375 | 3.36e-08 |
| Dummy Baseline | -0.000078 | 1.3849 | 0.6068 | 5.22e-15 |
Based on current regression trends, we can extrapolate when Importance-Optimised p-adic LR will achieve better performance (lower p-adic loss) than other models as the dataset grows. The confidence intervals are calculated using bootstrap resampling (n=1000).
| Model | Crossover Point (tags) |
95% Confidence Interval | Probability | Estimated Date |
|---|---|---|---|---|
| UNN (Unconstrained Neural Networks) | 20,222 | 16,216 - 29,699 (95% CI, σ=3,730) | >95% | 2026-09-04 (±uncertain, R²=0.993, growth=50.4/tag/day) |
Statistical Notes: The crossover points are calculated by finding where the regression lines intersect. The 95% confidence intervals are derived from bootstrap resampling of the regression parameters. The probability estimates indicate the likelihood that the crossover will occur given the current trends. Date predictions are based on linear extrapolation of dataset growth and should be interpreted with caution.
Regression: p-adic loss = slope × log₁₀(params) + intercept
| Line | Slope | Intercept | R² | p-value | Significant? | n |
|---|---|---|---|---|---|---|
| With Dummy | -0.0698 | 0.6474 | 0.2301 | 0.1354 | No | 11 |
| Without Dummy | -0.1223 | 0.8396 | 0.1616 | 0.2495 | No | 10 |
Regression: log₁₀(loss) = slope × log₁₀(params) + intercept
| Slope | Intercept | R² | p-value | Significant? | n |
|---|---|---|---|---|---|
| -0.1108 | -0.2041 | 0.9062 | 0.0125 | Yes | 5 |