Visualizing Stochastic Regularization for Entity Embeddings

Author:Murphy  |  View: 26546  |  Time: 2025-03-23 11:55:26
Photo by Rachael Crowe on Unsplash

Industry data often contains non-numeric data with many possible values, for example zip codes, medical diagnosis codes, preferred footwear brand. These high-cardinality categorical features contain useful information, but incorporating them into machine learning models is a bit of an art form.

I've been writing a series of blog posts on methods for these features. Last episode, I showed how perturbed training data (stochastic regularization) in neural network models can dramatically reduce overfitting and improve performance on unseen categorical codes [1].

In fact, model performance for unseen codes can approach that of known codes when hierarchical information is used with stochastic regularization!

Here, I use visualizations and SHAP values to "look under the hood" and gain some insights into how entity embeddings respond to stochastic regularization. The pictures are pretty, and it's cool to see plots shift as data is changed. Plus, the visualizations suggest model improvements and can identify groups that might be of interest to analysts.

NAICS Codes

I use the public "SBA Case" dataset [4–5] consisting of loans from the US Small Business Administration. The categorical feature of interest is NAICS (North American Industry Classification System). NAICS identifies industry types for the small businesses in the dataset [6].

NAICS codes are very commonly used in government reporting and in business-to-business commercial applications. Experts at US, Mexican, and Canadian government agencies organize establishments in a hierarchical manner. Table 1 illustrates the NAICS coding system, with examples:

Table 1. Examples of NAICS codes illustrating their hierarchical structure. The low-level codes are bucketed into more general groups. The lowest-level, 6-digit National Industry code can take ~1,200 values, but there are only 21 codes at the highest (sector) level. Table by author and copied from my previous blog post [2].

Entity Embeddings

In neural network models, entity embeddings map categorical features to numeric vectors. This is commonly done for "high cardinality" categoricals, i.e. categories that can take many levels.

For NAICS, such entity embeddings might condense the ~1,200 possible code values into a vector of (say) 8 floating point values. However, not all possible codes typically appear in the training data. This could be random, due to a small training dataset, reflect changes in the market, or simply be intrinsic to the coding system, as codes are updated or added periodically. Such unseen codes tend to be over-fit in entity embeddings, but in my previous blog post, I showed that randomly injecting dummy values during training (shuffled between mini-batches) greatly improved performance for unseen codes [1].

Unseen codes tend to be over-fit in entity embeddings, but in my previous blog post, I showed that randomly injecting dummy values during training (shuffled between mini-batches) greatly improved performance for unseen codes

As shown in Figure 1, NAICS codes have a hierarchical organization. Each level of the hierarchy could get its own entity embedding. Stochastic **** randomization is especially helpful in this scenario [1].

Methodology

Methodology details can be found in my previous posts [1–2], and all code is available on GitHub [7]. Briefly, I use neural network models to perform logistic regression, predicting loan defaults using numeric features and NAICS entity embeddings. For the entity embeddings, inputs are NAICS codes encoded using scikit-learn's OrdinalEncoder, with the value of 1 reserved for unseen codes.

When I train models using stochastic regularization, I replace NAICS encodings with 1 at random, with different cases selected for randomization at each mini-batch. I reserve 10% of NAICS codes as unseen, to test how the model handles unseen values.

Visualizing NAICS Embeddings

Let's start with a simple neural network model that takes some numeric inputs, plus the lowest-level NAICS code. Figure 1 shows a diagram of this model, which contains one entity embedding layer. After training, the embeddings map each NAICS code to a point in 8-dimensional space.

Figure 1. Simplified diagram of neural network model involving base NAICS only (no hierarchy). The 8-dimensional entity embedding layer is shown in green. Numeric features and entity embeddings are concatenated and input into hidden layers (tanh activations), with sigmoid activation at the output. (A more detailed diagram including dropout, flatten, etc. layers is available here). Image by author.

Since it's hard to visualize an 8-dimensional space, I use TSNE [8] to create a 2D visualization (Figure 2). TSNE plots emphasize neighbor relationships, as opposed to long-range effects. I am interested in stochastic regularization, so I compare a model trained with unmodified data with a model trained with random injection of unseen codes (Figure 2).

Figure 2. TSNE visualization of NAICS embeddings for models with the structure shown in Figure 1. Points are shaded by the mean target rate for the NAICS codes (only industries with at least 100 loans are plotted). Holdout NAICS, which are unseen during training, are outlined in black. (A) Embeddings from a model trained on unmodified data. (B) Embeddings from a model trained on data with random injection of unseen codes. Image by author

Looking at the embeddings in Figure 2A, they appear sort of 1-dimensional in an elongated blob. There is clear progression from high to low risk along the primary axis. The large cluster of points on the far left corresponds to the holdout unseen NAICS codes. (These have identical embedding vector values, but TSNE shows them as close neighbors.)

In Figure 2B, the unseen codes moved! They cluster more in the middle along the horizontal axis. They also are separated from the known codes in the vertical axis. This behavior consistent with model performance data in my last blog, where I saw that stochastic randomization reduced overfitting [1].

Being closer to the middle, or typical vales, seems better when we are missing information. Plus, the separation of unknown vs. known code in Figure 2B suggests that the model is treating unknowns differently than known codes with near-average default rates.

Visualizing NAICS Hierarchy Embeddings

As shown in Table 1, NAICS codes are organized hierarchically, with subsets of the 6-digit code being used to define general categories. I will use additional entity embedding layers to input information for many levels of the hierarchy.

To see combined effects from all levels I must change my model architecture a bit (Figure 3). For each NAICS, I input the base code (6 digits), industry group (4 digits), subsector (3 digits), and sector codes into separate entity embedding layers, which merge into one layer.

Figure 3. Simplified diagram of neural network model incorporating multiple levels of the NAICS hierarchy. Entity embeddings are concatenated and input into an 8-dimensional hidden layer (tanh activation). Outputs from this hidden layer are concatenated with numeric inputs and then sent to more hidden layers (tanh activations), with sigmoid activation at the output. ( A more detailed diagram including dropout, flatten, etc. layers is available here). Image by author.

The difference between this model and a more usual neural network for numeric and categorical features is the extra hidden layer (outlined in light green in Figure 3), which combines inputs from levels of the code hierarchy.

The extra layer doesn't make much difference in model performance (table in the repo), but enables visualizations. The weights of the extra hidden layers are now "embeddings" that reflect contributions from many levels of the code hierarchy.

Importantly, combining NAICS inputs enables obtaining embeddings for unseen codes when hierarchical information is available!

Now, let's visualize those hidden layer weights with and without random injection of unseen code values (Figure 4). Compared with Figures 1–2, points are spread are a bit more 2-dimensionally. In addition, the region associated with higher default rates is flipped to the right side of the x-axis.

Figure 4. TSNE visualization of NAICS embeddings for models with the structure shown in Figure 3 (NAICS hierarchy, with extra hidden layer). Weights in the extra hidden layer are visualized. Points are shaded by the mean target rate for the NAICS codes (only industries with at least 100 loans are plotted). Holdout NAICS, which are unseen during training, are outlined in black. (A) Embeddings from a model trained on unmodified data. (B) Embeddings from a model trained on data with random injection of unseen codes. Image by author

In Figure 4, differences due to random injection are not immediately apparent, except for the "holdout" codes which are outlined in black. It appears that for unmodified data, these tend to be more towards the upper right of the plots. With stochastic regularization, they appear more centered.

Let me look at the effects of missing code information another way. Starting with NAICS seen in training , I will drop information from portions of the hierarchy. Then, I can look at mappings for different amounts of information.

Figure 5. TSNE visualizations of the models which uses the NAICS hierarchy (Figure 3) , with inputs modified to use only portions of the hierarchy. 10% of points, shown as white circles, are randomly selected for modification. The original embeddings are at the far left. Each subsequent column sets one additional level to unseen for the selected codes, until the far right, where inputs are set to 1 for all levels of the hierarchy. The first row of plots is from the model trained with unmodified data, and the second row shows the models trained with random injection of unseen codes.

Figure 5 shows that without random injection, values start to drift towards the upper right of the plot as information is lost; entirely unseen codes are mapped to a blob at the far right (again, this is a single value that TSNE shows as close neighbors). With stochastic regularization, values shrink towards the center of the blob as information is lessened.

The data in Figure 5 can be quantified by measuring the distance between the embeddings and (A) the average or center of the embeddings, or (B) the original embedding values. Note that these calculations are done in the 8-digit embedding space, not the 2D TSNE representation of it.

FIgure 6. Quantification of trends in Figure 5. Left plot: Distances in embedding space between codes and the center of the original embeddings, vs. the amount of NAICS information. Right plot: Distances to the actual embeddings, as information is dropped.

Without randomization, codes remain off-center, which is consistent with over-fitting (Figure 6A). Stochastic regularization results in codes moving towards the center as information is lost. Embeddings of codes with reduced information also closer to their actual values when models are trained on randomized data (Figure 6B). As you would expect, the lower you get in the code hierarchy, the closer the mapping.

SHAP Feature Importances

I can learn more about the embeddings using Shapley (SHAP) tests. Shapley values distribute model scores for individual observations among all predictor features in a model. More influential features receive a greater share, and values are additive. (For a detailed discussion of Shapley/SHAP, see [9, 10].) Here, I aggregate SHAP values to get a measure of feature importance [10], and focus mostly on NAICS features vs. other features (Figure 7).

Figure 7. Aggregated SHAP values for models using entity embeddings for (top row) NAICS, or (bottom row) NAICS plus its hierarchy. Blue bars show esults for models trained on unmodified data, while orange bars show models trained with random injection of the unseen code. The left column of plots shows standard test cases, and the right unseen NAICS cases. For each observation, SHAP values for all NAICS features are added, and the mean absolute value is computed (I use a random sample of 5,000 loans). This is compared to the mean absolute SHAP value for all non-NAICS features.

For holdout (unseen) codes, stochastic regularization reduces model reliance on NAICS. For the hierarchical model, some influence of NAICS remains as the model leverage use higher-level hierarchical information. Random injection also seems to reduce SHAP values for non-NAICS features; some of this may reflect interactions between NAICS embeddings and other features.

Figure 7 confirms that random injection reduces overfitting, but what about the embeddings? I will focus now on the model with stochastic regularization and that uses multiple levels of the NAICS hierarchy. Figure 8 shows the embedding visualization, colored by the mean absolute SHAP value of the NAICS vs. other features.

Figure 8. TSNE visualization of mean SHAP values by NAICS for the NAICS hierarchy + random injection model. (A) Mean absolute value of SHAP values for NAICS features, vs. (B) for other features.

Figure 8A shows that the NAICS features are relatively unimportant along the center of the x-axis, but become important at the left and right. This is consistent with the overall trend towards low/high mean target rates in Figure 4B.

The left portion of codes is interesting to me – this portion of codes have a high contribution from NAICS and relatively low contribution from other features, suggesting that this group of codes is protective against default. This behavior is verified using k-means clustering and SHAP values (not shown; see).

Which NAICS Reduce Risk?

The left portion of the embeddings (below x1 ~ -10) in Figure 8A look to be a reduced risk group. Which industries are these? Maybe this group could be fast-tracked for loan approval? Let's look at the portion of codes in this group, by standard NAICS sector:

Figure 9. Fraction of codes (measured by loan count) in the high risk portion of embedding space, by NAICS sector.

I inspected individual codes in many of the groups, and saw some additional interesting things.

First, many low-risk codes come from older vintages. NAICS codes are updated every 5 years, starting with the original 1997 code set. It turns out that many of the lowest risk codes come from that first vintage!

This dataset consists of loans with known outcomes (paid in full vs. in default). Older loans are highly likely to have reached a "done" state, while newer loans in good standing are not included. This might lead older loans to appear to have lower default rates.

Older codes are common in construction and wholesale trade, which underwent substantial NAICS revisions in 2002 [11]. For both sectors, nearly all of their low-risk codes are older; if old codes are removed the overall risk of the sectors is much higher. This is a potential problem for models that new loan risks using broad code groups (e.g. one-hot encoding at the sector level).

Depending on what a model is for (assessment of risk for a current portfolio? Deciding whether to approve a new loan?), you might handle this different ways. It might be better to filter the data to years where loans have reached completion or include current loans instead of focusing on paid in full. Or, it might be helpful to include a loan tenure feature.

There are other potentially important patterns in the codes:

  • In "Transportation and Warehousing", transportation is high risk and warehousing low, so this sector could be split to get better groups for loan defaults
  • Similarly, the "waste management" subgroup seems to be lower risk within "Administrative and Support and Waste Management and Remediation Services."
  • Caterers seem to be especially vulnerable within the already high risk "Accommodation and Food Services" sector

Some of these observations suggest alternative groupings of codes for analysis, or subgroups to examine in detail. It's possible to completely re-classify codes using embeddings, but on the other hand the standard groupings are accessible and familiar to humans. Some hybrid summarization with higher vs. lower buckets within sectors might bridge the gap.

Final Thoughts

I decided to write this series of blog posts because I've encountered categorical features across different jobs. My experience is that high-cardinality categoricals are very common in industry, but not emphasized in coursework, public datasets, and suchlike.

The original paper containing the SBA loans dataset [4] suggested a class assignment, involving data exploration and building a logistic regression model for a single industry. If I were a teacher, I might extend the assignment to all industries using Machine Learning techniques (specifically XGBoost and neural networks). Such a course would be very practical and involve a mix of basic and advanced skills likely to be useful in work. My series of blog posts could be a rough outline – I'm learning a lot, at least!

My last couple posts have focused on stochastic data randomization in entity embeddings, which can greatly improve model performance of neural network models! In the future, I want to explore other ways to group codes, and see if an analogous stochastic randomization might help XGBoost models. Thanks for reading!

References

[1] V. Carey, Data Disruptions to Elevate Entity Embeddings (2024), Towards Data Science.

[2] V. Carey, No Label Left Behind: Alternative Encodings for Hierarchical Categoricals (2024), Towards Data Science.

[3] V. Carey, Exploring Hierarchical Blending in Target Encoding (2024), Towards Data Science.

[4] M. Li, A. Mickel and S. Taylor, Should This Loan be Approved or Denied?: A Large Dataset with Class Assignment Guidelines (2018), Journal of Statistics Education 26 (1). (CC BY 4.0)

[5] M. Toktogaraev, Should This Loan be Approved or Denied? (2020), Kaggle. (CC BY-SA 4.0)

[6] United States Census, North American Industry Classification System.

[7] V. Carey, GitHub Repository, https://github.com/vla6/Blog_naics_nn

[8] L. van der Maaten and G. Hinton, Visualizing Data using t-SNE (2008), Journal of Machine Learning Research 9: 2579–2605.

[9] C. M. Wilde One Feature Attribution Method to (Supposedly) Rule Them All: Shapley Values (2018), Towards Data Science.

[10] S. Lundberg, Interpretable Machine Learning with XGBoost (2018), Towards Data Science.

[11] Bureau of the Census, Bridge Between 2002 NAICS and 1997 NAICS: 2002 (2002) 2002 Economic Census, Core Business Statistics Series, EC02–00C-BDG.

Tags: Categorical Data Editors Pick Embedding Entity Embedding Machine Learning

Comment