Visualizing Stochastic Regularization for Entity Embeddings

Industry data often contains non-numeric data with many possible values, for example zip codes, medical diagnosis codes, preferred footwear brand. These high-cardinality categorical features contain useful information, but incorporating them into machine learning models is a bit of an art form.
I've been writing a series of blog posts on methods for these features. Last episode, I showed how perturbed training data (stochastic regularization) in neural network models can dramatically reduce overfitting and improve performance on unseen categorical codes [1].
In fact, model performance for unseen codes can approach that of known codes when hierarchical information is used with stochastic regularization!
Here, I use visualizations and SHAP values to "look under the hood" and gain some insights into how entity embeddings respond to stochastic regularization. The pictures are pretty, and it's cool to see plots shift as data is changed. Plus, the visualizations suggest model improvements and can identify groups that might be of interest to analysts.
NAICS Codes
I use the public "SBA Case" dataset [4–5] consisting of loans from the US Small Business Administration. The categorical feature of interest is NAICS (North American Industry Classification System). NAICS identifies industry types for the small businesses in the dataset [6].
NAICS codes are very commonly used in government reporting and in business-to-business commercial applications. Experts at US, Mexican, and Canadian government agencies organize establishments in a hierarchical manner. Table 1 illustrates the NAICS coding system, with examples:

Entity Embeddings
In neural network models, entity embeddings map categorical features to numeric vectors. This is commonly done for "high cardinality" categoricals, i.e. categories that can take many levels.
For NAICS, such entity embeddings might condense the ~1,200 possible code values into a vector of (say) 8 floating point values. However, not all possible codes typically appear in the training data. This could be random, due to a small training dataset, reflect changes in the market, or simply be intrinsic to the coding system, as codes are updated or added periodically. Such unseen codes tend to be over-fit in entity embeddings, but in my previous blog post, I showed that randomly injecting dummy values during training (shuffled between mini-batches) greatly improved performance for unseen codes [1].
Unseen codes tend to be over-fit in entity embeddings, but in my previous blog post, I showed that randomly injecting dummy values during training (shuffled between mini-batches) greatly improved performance for unseen codes
As shown in Figure 1, NAICS codes have a hierarchical organization. Each level of the hierarchy could get its own entity embedding. Stochastic **** randomization is especially helpful in this scenario [1].
Methodology
Methodology details can be found in my previous posts [1–2], and all code is available on GitHub [7]. Briefly, I use neural network models to perform logistic regression, predicting loan defaults using numeric features and NAICS entity embeddings. For the entity embeddings, inputs are NAICS codes encoded using scikit-learn's OrdinalEncoder, with the value of 1 reserved for unseen codes.
When I train models using stochastic regularization, I replace NAICS encodings with 1 at random, with different cases selected for randomization at each mini-batch. I reserve 10% of NAICS codes as unseen, to test how the model handles unseen values.
Visualizing NAICS Embeddings
Let's start with a simple neural network model that takes some numeric inputs, plus the lowest-level NAICS code. Figure 1 shows a diagram of this model, which contains one entity embedding layer. After training, the embeddings map each NAICS code to a point in 8-dimensional space.

Since it's hard to visualize an 8-dimensional space, I use TSNE [8] to create a 2D visualization (Figure 2). TSNE plots emphasize neighbor relationships, as opposed to long-range effects. I am interested in stochastic regularization, so I compare a model trained with unmodified data with a model trained with random injection of unseen codes (Figure 2).

Looking at the embeddings in Figure 2A, they appear sort of 1-dimensional in an elongated blob. There is clear progression from high to low risk along the primary axis. The large cluster of points on the far left corresponds to the holdout unseen NAICS codes. (These have identical embedding vector values, but TSNE shows them as close neighbors.)
In Figure 2B, the unseen codes moved! They cluster more in the middle along the horizontal axis. They also are separated from the known codes in the vertical axis. This behavior consistent with model performance data in my last blog, where I saw that stochastic randomization reduced overfitting [1].
Being closer to the middle, or typical vales, seems better when we are missing information. Plus, the separation of unknown vs. known code in Figure 2B suggests that the model is treating unknowns differently than known codes with near-average default rates.
Visualizing NAICS Hierarchy Embeddings
As shown in Table 1, NAICS codes are organized hierarchically, with subsets of the 6-digit code being used to define general categories. I will use additional entity embedding layers to input information for many levels of the hierarchy.
To see combined effects from all levels I must change my model architecture a bit (Figure 3). For each NAICS, I input the base code (6 digits), industry group (4 digits), subsector (3 digits), and sector codes into separate entity embedding layers, which merge into one layer.

The difference between this model and a more usual neural network for numeric and categorical features is the extra hidden layer (outlined in light green in Figure 3), which combines inputs from levels of the code hierarchy.
The extra layer doesn't make much difference in model performance (table in the repo), but enables visualizations. The weights of the extra hidden layers are now "embeddings" that reflect contributions from many levels of the code hierarchy.
Importantly, combining NAICS inputs enables obtaining embeddings for unseen codes when hierarchical information is available!
Now, let's visualize those hidden layer weights with and without random injection of unseen code values (Figure 4). Compared with Figures 1–2, points are spread are a bit more 2-dimensionally. In addition, the region associated with higher default rates is flipped to the right side of the x-axis.

In Figure 4, differences due to random injection are not immediately apparent, except for the "holdout" codes which are outlined in black. It appears that for unmodified data, these tend to be more towards the upper right of the plots. With stochastic regularization, they appear more centered.
Let me look at the effects of missing code information another way. Starting with NAICS seen in training , I will drop information from portions of the hierarchy. Then, I can look at mappings for different amounts of information.

Figure 5 shows that without random injection, values start to drift towards the upper right of the plot as information is lost; entirely unseen codes are mapped to a blob at the far right (again, this is a single value that TSNE shows as close neighbors). With stochastic regularization, values shrink towards the center of the blob as information is lessened.
The data in Figure 5 can be quantified by measuring the distance between the embeddings and (A) the average or center of the embeddings, or (B) the original embedding values. Note that these calculations are done in the 8-digit embedding space, not the 2D TSNE representation of it.

Without randomization, codes remain off-center, which is consistent with over-fitting (Figure 6A). Stochastic regularization results in codes moving towards the center as information is lost. Embeddings of codes with reduced information also closer to their actual values when models are trained on randomized data (Figure 6B). As you would expect, the lower you get in the code hierarchy, the closer the mapping.
SHAP Feature Importances
I can learn more about the embeddings using Shapley (SHAP) tests. Shapley values distribute model scores for individual observations among all predictor features in a model. More influential features receive a greater share, and values are additive. (For a detailed discussion of Shapley/SHAP, see [9, 10].) Here, I aggregate SHAP values to get a measure of feature importance [10], and focus mostly on NAICS features vs. other features (Figure 7).

For holdout (unseen) codes, stochastic regularization reduces model reliance on NAICS. For the hierarchical model, some influence of NAICS remains as the model leverage use higher-level hierarchical information. Random injection also seems to reduce SHAP values for non-NAICS features; some of this may reflect interactions between NAICS embeddings and other features.
Figure 7 confirms that random injection reduces overfitting, but what about the embeddings? I will focus now on the model with stochastic regularization and that uses multiple levels of the NAICS hierarchy. Figure 8 shows the embedding visualization, colored by the mean absolute SHAP value of the NAICS vs. other features.

Figure 8A shows that the NAICS features are relatively unimportant along the center of the x-axis, but become important at the left and right. This is consistent with the overall trend towards low/high mean target rates in Figure 4B.
The left portion of codes is interesting to me – this portion of codes have a high contribution from NAICS and relatively low contribution from other features, suggesting that this group of codes is protective against default. This behavior is verified using k-means clustering and SHAP values (not shown; see).
Which NAICS Reduce Risk?
The left portion of the embeddings (below x1 ~ -10) in Figure 8A look to be a reduced risk group. Which industries are these? Maybe this group could be fast-tracked for loan approval? Let's look at the portion of codes in this group, by standard NAICS sector:

I inspected individual codes in many of the groups, and saw some additional interesting things.
First, many low-risk codes come from older vintages. NAICS codes are updated every 5 years, starting with the original 1997 code set. It turns out that many of the lowest risk codes come from that first vintage!
This dataset consists of loans with known outcomes (paid in full vs. in default). Older loans are highly likely to have reached a "done" state, while newer loans in good standing are not included. This might lead older loans to appear to have lower default rates.
Older codes are common in construction and wholesale trade, which underwent substantial NAICS revisions in 2002 [11]. For both sectors, nearly all of their low-risk codes are older; if old codes are removed the overall risk of the sectors is much higher. This is a potential problem for models that new loan risks using broad code groups (e.g. one-hot encoding at the sector level).
Depending on what a model is for (assessment of risk for a current portfolio? Deciding whether to approve a new loan?), you might handle this different ways. It might be better to filter the data to years where loans have reached completion or include current loans instead of focusing on paid in full. Or, it might be helpful to include a loan tenure feature.
There are other potentially important patterns in the codes:
- In "Transportation and Warehousing", transportation is high risk and warehousing low, so this sector could be split to get better groups for loan defaults
- Similarly, the "waste management" subgroup seems to be lower risk within "Administrative and Support and Waste Management and Remediation Services."
- Caterers seem to be especially vulnerable within the already high risk "Accommodation and Food Services" sector
Some of these observations suggest alternative groupings of codes for analysis, or subgroups to examine in detail. It's possible to completely re-classify codes using embeddings, but on the other hand the standard groupings are accessible and familiar to humans. Some hybrid summarization with higher vs. lower buckets within sectors might bridge the gap.
Final Thoughts
I decided to write this series of blog posts because I've encountered categorical features across different jobs. My experience is that high-cardinality categoricals are very common in industry, but not emphasized in coursework, public datasets, and suchlike.
The original paper containing the SBA loans dataset [4] suggested a class assignment, involving data exploration and building a logistic regression model for a single industry. If I were a teacher, I might extend the assignment to all industries using Machine Learning techniques (specifically XGBoost and neural networks). Such a course would be very practical and involve a mix of basic and advanced skills likely to be useful in work. My series of blog posts could be a rough outline – I'm learning a lot, at least!
My last couple posts have focused on stochastic data randomization in entity embeddings, which can greatly improve model performance of neural network models! In the future, I want to explore other ways to group codes, and see if an analogous stochastic randomization might help XGBoost models. Thanks for reading!
References
[1] V. Carey, Data Disruptions to Elevate Entity Embeddings (2024), Towards Data Science.
[2] V. Carey, No Label Left Behind: Alternative Encodings for Hierarchical Categoricals (2024), Towards Data Science.
[3] V. Carey, Exploring Hierarchical Blending in Target Encoding (2024), Towards Data Science.
[4] M. Li, A. Mickel and S. Taylor, Should This Loan be Approved or Denied?: A Large Dataset with Class Assignment Guidelines (2018), Journal of Statistics Education 26 (1). (CC BY 4.0)
[5] M. Toktogaraev, Should This Loan be Approved or Denied? (2020), Kaggle. (CC BY-SA 4.0)
[6] United States Census, North American Industry Classification System.
[7] V. Carey, GitHub Repository, https://github.com/vla6/Blog_naics_nn
[8] L. van der Maaten and G. Hinton, Visualizing Data using t-SNE (2008), Journal of Machine Learning Research 9: 2579–2605.
[9] C. M. Wilde One Feature Attribution Method to (Supposedly) Rule Them All: Shapley Values (2018), Towards Data Science.
[10] S. Lundberg, Interpretable Machine Learning with XGBoost (2018), Towards Data Science.
[11] Bureau of the Census, Bridge Between 2002 NAICS and 1997 NAICS: 2002 (2002) 2002 Economic Census, Core Business Statistics Series, EC02–00C-BDG.