Feature Engineering Techniques for Numerical Variables in Python

Feature Engineering is an essential step in a machine learning pipeline, where raw data is transformed into more meaningful features that help the model better understand the relationships in the data.
Feature engineering often means applying transformations to the data at hand to overwrite or create new data that, in the context of machine learning and data science, is used to train a model that can perform better thanks to these transformations.
In this article, we will explore advanced feature engineering techniques for handling numeric values with Python's Scikit-Learn library (which can be used via the BSD 3-Clause License for this work), Numpy, and more to make your Machine Learning models even more effective.
In summary, by reading this article you will learn:
- A robust list of feature engineering techniques for numerical data from the Scikit-Learn, Numpy and Scipy suites to improve the performance of machine learning models
- Practical implementation of logarithmic and Box-Cox transformations to normalize distributions and linearize relationships in data
- Specific use cases for each technique and how they can reveal latent patterns, refine the representation of variables and improve the interpretability of models
This article is related to the guide to handling categorical variables in Python, which shows how to do feature engineering for non-numeric variables.
Use Cases
Feature optimization is a key element in improving the quality of machine learning models, especially when analyzing complex datasets. The targeted application of feature engineering techniques offers several advantages:
- Revealing latent patterns in data: This technique allows you to discover hidden relationships and structures that are not obvious at first glance.
- Refining the representation of variables: The process transforms raw data into formats that are more suitable for machine learning.
- Addressing challenges related to the distribution and intrinsic nature of data: This approach addresses issues such as skewness, outliers, and scalability of variables.
Precise implementation of these feature optimization techniques leads to significant performance improvements in machine learning models.
These improvements are reflected in various aspects of model performance, from their predictive ability to their interpretability. The higher quality of the features used allows models to capture nuances and complex patterns in the data that might otherwise remain hidden.
Feature optimization also helps make models more robust and generalizable, which is essential for real-world applications, and reduces the possibility of overfitting.
Let's start with some useful feature engineering techniques.
Normalization
Probably the first numerical feature engineering technique a data scientist learns – normalization (also known as scaling) is a method where we change a variable by subtracting the mean and dividing it by the standard deviation.
Performing this transformation means that the resulting variable will have a mean of 0 and a standard deviation and variance of 1.
In machine learning, especially deep learning, having variables confined between specific values (say, just 0 and 1) helps the model converge to an optimal solution sooner.
This technique is a learned transformation – so we use the training data to derive the correct values of and and then these values are used to perform the transformations when applied to new data.
It should be noted that this transformation does not change the distribution, but rescales the values.
Practical application
We will use the famous Sklearn wine dataset for a classification task. We will compare the performance using and not using confusion matrix normalization, using Sklearn.
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
import numpy as np
X, y = load_wine(return_X_y=True)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
# Function to train the model and get the confusion matrix
def get_confusion_matrix(X_train, X_test, y_train, y_test):
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
return confusion_matrix(y_test, y_pred)
# Get the confusion matrix without normalization
cm_without_norm = get_confusion_matrix(X_train, X_test, y_train, y_test)
# Normalize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Get the confusion matrix with normalization
cm_with_norm = get_confusion_matrix(X_train_scaled, X_test_scaled, y_train, y_test)
# Create the figure with two side-by-side subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
# Function to create the heatmap
def plot_heatmap(ax, cm, title):
sns.heatmap(cm, annot=True, fmt='d', cmap='viridis', ax=ax, cbar=False)
ax.set_title(title, fontsize=16, pad=20)
ax.set_xlabel('Predicted', fontsize=12, labelpad=10)
ax.set_ylabel('Actual', fontsize=12, labelpad=10)
# Plot the heatmaps
plot_heatmap(ax1, cm_without_norm, 'Confusion MatrixnWithout Normalization')
plot_heatmap(ax2, cm_with_norm, 'Confusion MatrixnWith Normalization')
# Add a common colorbar
cbar_ax = fig.add_axes([0.92, 0.15, 0.02, 0.7])
sm = plt.cm.ScalarMappable(cmap='viridis', norm=plt.Normalize(vmin=0, vmax=np.max([cm_without_norm, cm_with_norm])))
fig.colorbar(sm, cax=cbar_ax)
# Adjust the layout and display the plot
plt.tight_layout(rect=[0, 0, 0.9, 1])
plt.show()

The improvement delta is about 30% – normalization on some algorithms has such a big impact that not applying it properly is a serious mistake on the part of the data scientist.
There are also variants of normalization. In Sklearn, these are called
[RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html?ref=diariodiunanalista.it#sklearn.preprocessing.RobustScaler)
and[MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html?ref=diariodiunanalista.it#sklearn.preprocessing.MinMaxScaler)
.
A more complex graph showing the classification edges of the KNNClassifier
model is available from the Sklearn examples

Polynomial Features
Polynomial features are useful for introducing nonlinearity into linear models. The Scikit-Learn PolynomialFeatures
class allows you to generate both polynomial features and interaction terms between variables.
- _x_² (squared)
- _x_³
- _x_⁴
- and so on
For models with multiple features (x_1, x_2, …, x_n), interaction terms can also be created:
- _x__1 × _x__2
- _x__1² × _x__2
- _x__1 × _x__2²
and so on.
The main goal is to allow linear models to learn nonlinear relationships in data without changing the underlying algorithm.

Their main strength is that they significantly increase model flexibility, allowing even linear models to capture nonlinear relationships in the data. This translates into:
- Ability to model complex curves and surfaces in feature space
- Potential positive performance contribution on intrinsically nonlinear data
- The model, while capturing nonlinear relationships, maintains a linear underlying structure. This allows familiar analysis tools to be used and coefficients to be interpreted more easily than with complex nonlinear models, although this simplicity decreases with increasing polynomial degree.
Another crucial advantage is the ability to reveal hidden interactions between variables. In domains such as physics or economics, where relationships are often nonlinear, this feature is particularly valuable.
As for the disadvantages, however
- It rapidly increases the dimensionality of the dataset, creating additional columns for each feature submitted to the algorithm
- It can lead to overfitting if used excessively
- It requires more computational resources, precisely because of the need to process a greater number of features
From a practical point of view, the implementation of polynomial features is relatively simple thanks to Sklearn in Python. We will see how to do it shortly.
Practical application
PolynomialFeatures
is a Scikit-learn class used to generate polynomial features. It is found in the sklearn.preprocessing
module.
The object transforms a 1-dimensional input array into a new array containing all the polynomial terms up to a specified degree. For example, if the original features are [a, b], with degree=2, the resulting features will be [1, a, b, a², ab, b²].
The arguments of the object are as follows:
-
degree (int, default=2): The degree of the polynomial. Determines the maximum degree of the generated polynomial terms.
- interaction_only (bool, default=False): If True, only interaction terms are generated. Does not produce powers of individual features.
- include_bias (bool, default=True): If True, includes a column of 1s (the bias term). Useful when using the result with models that do not have a separate intercept term.
- order (C or F, default=C): Determines the output order of the features. C for C-style order (last features change the fastest), ‘F' for Fortran-style order.
Here's an example of how to implement the object in Sklearn.
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
X = np.array([[1, 2], [3, 4]])
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
print(X_poly)
# Output: [[1. 2. 1. 2. 4.]
# [3. 4. 9. 12. 16.]]
print(poly.get_feature_names(['x1', 'x2']))
# Output: ['x1', 'x2', 'x1^2', 'x1 x2', 'x2^2']
Now these features can inform a machine learning model and possibly help it perform better.
FunctionTransformer
[FunctionTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html)
is a versatile tool in Scikit-learn that allows you to incorporate custom functions into your data transformation process. It allows you to apply an arbitrary function to your data as part of a preprocessing or feature engineering pipeline. Essentially, it transforms a Python function into a "transformer" object (not like the deep learning model, but the Sklearn one) that is compatible with the Scikit-learn API.
FunctionTransformer
takes a Python function as the main input and creates a transformer object that, when applied to data, executes that function. It can also be used in combination with other transformers or within a Scikit-learn pipeline.
A concrete example is applying the object to a time series to extract trigonometric features.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12, 4))
average_week_demand = df.groupby(["weekday", "hour"])["count"].mean()
average_week_demand.plot(ax=ax)
_ = ax.set(
title="Average hourly bike demand during the week",
xticks=[i * 24 for i in range(7)],
xticklabels=["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"],
xlabel="Time of the week",
y

Typical usage of FunctionTransformer
is seen in
- creating complex or domain-specific features
- applying unusual mathematical operations to data
- integrating existing preprocessing logic into Scikit-learn pipelines
FunctionTransformer
then acts as a bridge between custom Python functions and Scikit-learn functions, providing flexibility in data preprocessing and feature engineering.
Practical Application
Let's apply the function to create trigonometric transformations to the time series shown above
import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer
import matplotlib.pyplot as plt
def sin_transformer(period):
return FunctionTransformer(lambda x: np.sin(x / period * 2 * np.pi))
def cos_transformer(period):
return FunctionTransformer(lambda x: np.cos(x / period * 2 * np.pi))
hour_df = pd.DataFrame(
np.arange(26).reshape(-1, 1),
columns=["hour"],
)
hour_df["hour_sin"] = sin_transformer(24).fit_transform(hour_df)["hour"]
hour_df["hour_cos"] = cos_transformer(24).fit_transform(hour_df)["hour"]
hour_df.plot(x="hour")
_ = plt.title("Trigonometric encoding of the 'hour' feature")

This is just one of the application examples, precisely because this object finds a lot of use in time series.
KBinsDiscretizer
KBinsDiscretizer
is a preprocessing class in Scikit-learn designed to transform continuous features into discrete categorical features. This process is known as discretization, quantization, or binning. Some datasets with continuous features can benefit from discretization, because it can transform the dataset with continuous attributes into one with only nominal attributes.
Its goal is to divide the range of a continuous variable into a specific number of intervals (or bins). Each original value is then replaced with the label of the bin it falls into.
The algorithm works as follows:
- Analyzes the distribution of the continuous feature
- Creates a predefined number of bins based on this distribution
- Assigns each original value to the appropriate bin
- Replaces the original values with the bin labels or with one-hot encodings of the bins
Key parameters:
n_bins: Number of bins to create. Can be an integer or an array of integers for different bins per feature.
encode: Method to encode the bins (onehot, ordinal, or onehot-dense).
- onehot: Encodes the transformed result with one-hot encoding and returns a sparse matrix. Ignored features are always stacked on the right.
- onehot-dense: Encodes the transformed result with one-hot encoding and returns a "dense" array (i.e., not in sparse format).
- ordinal: Returns the bin encoded as an integer value.
strategy: Strategy to define bin boundaries (uniform, quantile, or kmeans).
- uniform: Creates bins of equal width.
- quantile: Creates bins for each feature containing the same number of points.
- kmeans: Defines bins using k-means clustering.
Some considerations:
- The choice of the number of bins and the strategy can significantly affect the results.
- It can lead to information loss, especially with few bins (like when we plot a histogram with few groups).
- Useful for algorithms sensitive to non-normal distributions or non-linear relationships.
Practical Application
We will see the application by looking at the performance of a linear regression and decision tree in learning continuous vs. discretized patterns.
We will create a fake dataset of random but semi-linear numbers, apply the models to the continuous data and then the same dataset but with discretized features.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.tree import DecisionTreeRegressor
# Setting the random number generator for reproducibility
rnd = np.random.RandomState(42)
# Creating the dataset
X = rnd.uniform(-3, 3, size=100) # 100 points between -3 and 3
y = np.sin(X) + rnd.normal(size=len(X)) / 3 # Sine function with added noise
X = X.reshape(-1, 1) # Reshape to the correct format for sklearn
# Applying KBinsDiscretizer
discretizer = KBinsDiscretizer(n_bins=10, encode="onehot")
X_binned = discretizer.fit_transform(X)
# Preparing for visualization
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(12, 5))
line = np.linspace(-3, 3, 1000).reshape(-1, 1) # Points for plotting
# Function to train and plot models
def train_and_plot(X_train, X_plot, ax, title):
# Linear regression
linear_reg = LinearRegression().fit(X_train, y)
ax.plot(line, linear_reg.predict(X_plot), linewidth=2, color="green", label="Linear Regression")
# Decision tree
tree_reg = DecisionTreeRegressor(min_samples_split=3, random_state=0).fit(X_train, y)
ax.plot(line, tree_reg.predict(X_plot), linewidth=2, color="red", label="Decision Tree")
# Original data
ax.plot(X[:, 0], y, "o", c="k", alpha=0.5)
ax.legend(loc="best")
ax.set_xlabel("Input Feature")
ax.set_title(title)
# Plot for original data
train_and_plot(X, line, ax1, "Results Before Discretization")
ax1.set_ylabel("Regression Output")
# Plot for discretized data
line_binned = discretizer.transform(line)
train_and_plot(X_binned, line_binned, ax2, "Results After Discretization")
plt.tight_layout()
plt.show()

Linear regression improves significantly after discretization, capturing nonlinearity better. The decision tree shows less change, as it is already capable of handling nonlinearity. This example illustrates how discretization can help linear models capture nonlinear relationships, potentially improving performance in some scenarios.
Logarithmic Transformation
The primary advantage of logarithmic transformation is its ability to compress the range of values, which is particularly useful for data with high variability or outliers.
- Compression of Range: The logarithmic transformation reduces the distance between the largest values while keeping smaller values relatively unchanged. This helps in normalizing skewed distributions, making right-tailed distributions more symmetric and closer to a normal distribution.
- Linearization: It can linearize non-linear relationships. For instance, it converts exponential relationships into linear ones, which simplifies analysis and improves the performance of models assuming linearity between variables.
- Handling Outliers: The transformation effectively manages extreme data, allowing outliers to be handled without removal, thus preserving potentially important information.
- Mathematical Definition: The most common logarithmic transformation uses the natural logarithm (base eee), defined as y=ln(x), where x is the original value and y is the transformed value. Note that this transformation is only defined for positive values of x, and may require adding a constant if zeros or negative values are present.
- Feature Scaling: It can be used as a feature scaling technique, complementing or replacing methods like standardization or min-max normalization. Additionally, it can enhance the performance of models such as linear regression, which benefit from features with more symmetric distributions.
Practical Application
In machine learning, the log transform is used most of the time when you want to normalize a distribution that is not naturally distributed.
For example, a well-known variable that is not normally distributed is that of the annual earnings – often you want to model this variable to provide value predictions, but working with this distribution is not convenient, especially if you use algorithms that do not model nonlinear data correctly.
With a logarithmic transformation, through numpy, it is possible to tend to the normal distribution, making the variable more easily predictable.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Create an example dataset with positively skewed values
np.random.seed(42)
data = {
'Income': np.random.exponential(scale=50000, size=1000) # Exponential distribution to simulate skewness
}
df = pd.DataFrame(data)
# Create a figure with two side-by-side subplots
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# Plot of the original distribution
axes[0].hist(df['Income'], bins=50, color='blue', alpha=0.7)
axes[0].set_title('Original Income Distribution')
axes[0].set_xlabel('Income')
axes[0].set_ylabel('Frequency')
# Apply the logarithmic transformation
df['Log_Income'] = np.log1p(df['Income']) # log1p is equivalent to log(x + 1)
# Plot of the transformed distribution
axes[1].hist(df['Log_Income'], bins=50, color='green', alpha=0.7)
axes[1].set_title('Log-transformed Income Distribution')
axes[1].set_xlabel('Log_Income')
axes[1].set_ylabel('Frequency')
# Show the plot
plt.tight_layout()
plt.show()

PowerTransformer
PowerTransformer
is a Sklearn preprocessing
module that contains logic used to make data more Gaussian-like. This is useful for modeling problems related to heteroskedasticity (i.e. non-constant variance) or other situations where normality is desired.
Currently, PowerTransformer
supports the Box-Cox and Yeo-Johnson transformations. The optimal parameter to stabilize variance and minimize skewness is estimated using maximum likelihood (log likelihood).
Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive and negative data.
In the context of machine learning, these transformations address several common challenges:
- Data Normalization: Many machine learning algorithms, such as linear regression, neural networks, and some clustering methods, assume that the data follows a normal distribution. The PowerTransformer can transform skewed or heavy-tailed distributions into shapes closer to Gaussian, which can enhance the performance of these models.
- Variance Stabilization: In real datasets, the variance of features often changes with their magnitude, a phenomenon known as heteroskedasticity. This can compromise the effectiveness of many algorithms. PowerTransformer helps stabilize variance, making it more consistent across different ranges of feature values.
- Relationship Linearization: Some algorithms, like linear regression, assume linear relationships between variables. PowerTransformer can linearize nonlinear relationships, broadening the applicability of these models to more complex datasets.
Box-Cox Transformation
The Box-Cox transformation is a family of power transformations that can stabilize variance and make data more normally distributed. It is mathematically defined as:
- y(λ) = (_y_λ – 1) / λ if y ≠ 0
- y(λ) = log(y) if y = 0
where:
- x is the original value,
- y is the transformed value,
- λ is the transformation parameter
The Box-Cox transformation is applied to positive data and requires the parameter λ to be estimated from the data to find the best transformation that normalizes the data.
PowerTransformer
behaves like a Sklearn Estimator and supports the.fit()
and.transform()
methods.
I won't go into detail about the Yeo-Johnson transform – just know that it's based on the Box-Cox transform and allows negative values.
Practical Application
As mentioned, the Yeo-Johnson transform is based on the Box-Cox transform, but the value that lambda can take can change. This makes the transformations essentially different in that they can give different results.
In Python, just pass one of your transformation methods as a string into the PowerTransformer
object
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PowerTransformer
# Generate data with both positive and negative values
np.random.seed(42)
data_positive = np.random.exponential(scale=2, size=1000) # Positive values
data_negative = -np.random.exponential(scale=0.5, size=200) # Negative values
data = np.concatenate([data_positive, data_negative]) # Combine positive and negative data
# Create two instances of PowerTransformer: one for Yeo-Johnson and one for Box-Cox for comparison
pt_yj = PowerTransformer(method='yeo-johnson', standardize=False) # Yeo-Johnson transformation
pt_bc = PowerTransformer(method='box-cox', standardize=False) # Box-Cox transformation
# Apply the transformations
data_yj = pt_yj.fit_transform(data.reshape(-1, 1)) # Apply Yeo-Johnson transformation
# Box-Cox requires positive data, so we add an offset to make all values positive
data_offset = data - np.min(data) + 1e-6 # Offset to ensure all values are positive
data_bc = pt_bc.fit_transform(data_offset.reshape(-1, 1)) # Apply Box-Cox transformation
# Visualize the results
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 5))
# Histogram of original data
ax1.hist(data, bins=50, edgecolor='black')
ax1.set_title("Original Data")
ax1.set_xlabel("Value")
ax1.set_ylabel("Frequency")
# Histogram of data after Yeo-Johnson transformation
ax2.hist(data_yj, bins=50, edgecolor='black')
ax2.set_title("Yeo-Johnson Transformation")
ax2.set_xlabel("Value")
# Histogram of data after Box-Cox transformation
ax3.hist(data_bc, bins=50, edgecolor='black')
ax3.set_title("Box-Cox Transformation")
ax3.set_xlabel("Value")
plt.tight_layout()
plt.show()

QuantileTransformer
A quantile transformation maps the distribution of a variable to another target distribution. Using Sklearn's QuantileTransformer
class, you can convert a non-normal distribution to a desired distribution.
Consider any distribution of events – each event in this distribution will have a probability associated with it that it will occur. This behavior is defined by the cumulative density function (CDF), which varies for each distribution.
The quantile function is the inverse of the CDF: while a CDF is a function that returns the probability of a value being equal to or less than a given value, the PPF is the inverse of this function and returns the value being equal to or less than a given probability.
In the context of outlier detection, QuantileTransformer
can be used to transform data to make it more visible. By transforming the data into a uniform distribution, outliers will be mapped to the extremes of the distribution, making them more distinguishable from inliers.
The QuantileTransformer
can force any arbitrary distribution into a Gaussian, provided there are sufficient training samples (thousands). Since it is a non-parametric method, it is more difficult to interpret than the parametric ones (Box-Cox and Yeo-Johnson).
Practical Application
Sklearn helps us again with the dedicated QuantileTransformer
object with an important parameter: output_distribution, which can accept the values "uniform" or "normal". These represent the distribution to which the data is mapped.
import numpy as np
from sklearn.preprocessing import QuantileTransformer
import matplotlib.pyplot as plt
# Create a sample dataset with a skewed distribution
np.random.seed(0)
data = np.random.exponential(scale=2, size=(1000, 1)) # Exponential distribution
# Initialize the QuantileTransformer
quantile_transformer = QuantileTransformer(n_quantiles=100, output_distribution='normal')
# Apply the transformation
data_transformed = quantile_transformer.fit_transform(data)
# Visualize the original and transformed data
plt.figure(figsize=(12, 6))
# Histogram of original data
plt.subplot(1, 2, 1)
plt.hist(data, bins=50, color='blue', edgecolor='black')
plt.title("Original Data (Exponential)")
# Histogram of transformed data
plt.subplot(1, 2, 2)
plt.hist(data_transformed, bins=50, color='green', edgecolor='black')
plt.title("Transformed Data (Normal)")
plt.show()

Examples of transformations: from specific distributions to normal
Below is a graphical visualization that compares different non-normal distributions and their relative transformation, going through some of the techniques we explored.
I used the image in the Sklearn documentation as a reference, modifying the ordering of the graphs for easier reading.

This image highlights the limitations of some transformations, which do not always succeed. For example, for the bimodal distribution, all attempts to transform to the normal curve fail except the quantile transformation.
Principal Component Analysis
PCA transforms a set of possibly correlated variables into a set of linearly uncorrelated variables called principal components. These principal components are ordered so that the first ones contain most of the variance present in the original variables.
I wrote a detailed article on what PCA is if you are interested