Exploratory Analysis of MEMS Sensor Data

Author:Murphy  |  View: 29734  |  Time: 2025-03-23 12:59:29

MEMS (Micro-electromechanical Systems) sensors are widely used in different applications, from game controllers and smartphones to unmanned aerial vehicles. In this article, I will show how to connect a gyroscope and accelerometer sensor, what kind of data it is possible to get from it, and how this data can be processed and visualized.

Let's get started.

Hardware

The MPU-6050 is a 6-axis sensor that combines a 3-axis gyroscope, a 3-axis accelerometer, and the I2C interface. As written in the datasheet, it is widely used for tablets and smartphones. When our smartphone or smartwatch calculates the steps and calories during the workout, the data from MEMS sensors is actually used. But sensors like this can be used for more than just sports. I decided to place the sensor in my apartment for several days and figure out if I would be able to detect and analyze different vibrations in the building where I live.

If we want to collect data within several days, the Raspberry Pi is a good solution for that. The Raspberry Pi is a cheap (30–50$) single-board computer; it has low power consumption and plenty of pins to connect different types of hardware. An MPU-6050 prototyping board can be ordered on Amazon for 3–5$. The sensor itself uses the I2C bus for data transfer, and it can be connected to a Rasberry Pi using only 4 wires:

Connection diagram, Image by author

Before using the sensor, the I2C bus should be enabled on the Raspbian OS (there are enough tutorials about how to connect the MPU6050 to the Raspberry Pi, so I will skip the "hardware" details here). After connecting the sensor, I created a simple Python application that reads the sensor data and writes it "as is" into log files:

from datetime import datetime
import smbus
import math
import time

# MPU6050 Registers
PWR_MGMT_1   = 0x6B
SMPLRT_DIV   = 0x19
CONFIG       = 0x1A
GYRO_CONFIG  = 0x1B
INT_ENABLE   = 0x38
ACCEL_XOUT_H = 0x3B
ACCEL_YOUT_H = 0x3D
ACCEL_ZOUT_H = 0x3F
GYRO_XOUT_H  = 0x43
GYRO_YOUT_H  = 0x45
GYRO_ZOUT_H  = 0x47

bus = smbus.SMBus(1)
address = 0x68

def device_init():
    """ Init the MPU-6050 """
    bus.write_byte_data(address, SMPLRT_DIV, 0x4)
    bus.write_byte_data(address, PWR_MGMT_1, 1)
    bus.write_byte_data(address, CONFIG, 0)
    bus.write_byte_data(address, GYRO_CONFIG, 24)
    bus.write_byte_data(address, INT_ENABLE, 1)

def read_byte(reg):
    """ Read 1 byte from the sensor """
    return bus.read_byte_data(address, reg)

def read_word(reg):
    """ Read 2 bytes from the sensor """ 
    h = bus.read_byte_data(address, reg)
    l = bus.read_byte_data(address, reg + 1)
    value = (h << 8) + l
    return value

def read_word_2c(reg):
    """ Read and convert the data """
    val = read_word(reg)
    return -((65535 - val) + 1) if val >= 0x8000 else val

def device_read():
    """ Get accel and gyro data """
    g_x = read_word_2c(GYRO_XOUT_H) / 131
    g_y = read_word_2c(GYRO_YOUT_H) / 131
    g_z = read_word_2c(GYRO_ZOUT_H) / 131
    a_x = read_word_2c(ACCEL_XOUT_H) / 16384
    a_y = read_word_2c(ACCEL_YOUT_H) / 16384
    a_z = read_word_2c(ACCEL_ZOUT_H) / 16384
    return g_x, g_y, g_z, a_x, a_y, a_z

if __name__ == "__main__":
    device_init()
    device_read()

    while True:
        timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')
        gyro_x1, gyro_y1, gyro_z1, accel_x1, accel_y1, accel_z1 = device_read()
        gyro_x2, gyro_y2, gyro_z2, accel_x2, accel_y2, accel_z2 = device_read()
        g_x, g_y, g_z = (gyro_x1 + gyro_x2)/2, (gyro_y1 + gyro_y2)/2, (gyro_z1 + gyro_z2)/2
        a_x, a_y, a_z = (accel_x1 + accel_x2)/2, (accel_y1 + accel_y2)/2, (accel_z1 + accel_z2)/2
        s_data = f"{timestamp},{g_x: .7f},{g_y: .7f},{g_z: .7f},{a_x: .7f},{a_y: .7f},{a_z: .7f}"

        # Save to log file
        log_filename = datetime.now().strftime('%Y-%m-%d.log')
        with open(log_filename, "a", encoding="ascii") as log_out:
            log_out.write(s_data + "n")

In the "production" scenario, we can send data to a Kafka topic or to any other cloud provider, but for the "home" test, it was enough just to run the application in the background:

nohup python3 accel_read.py >/dev/null 2>&1 &

After that, we can leave the Raspberry Pi running for several days.

As we can see from the code, all log files have a "YYYY-MM-DD" pattern. We can download these files from a Raspberry Pi using scp:

scp pi@raspberrypi3:/home/pi/Documents/AccelData/2023-08-01.log data

Now let's see what kind of data we can get.

General Insights

First, let's see what the gyroscope and accelerometer data look like. We need to include the needed libraries:

import pandas as pd

from bokeh.plotting import figure, show
from bokeh.models import Range1d, DatetimeTickFormatter
from bokeh.io import output_notebook
from bokeh.layouts import row, column, gridplot
output_notebook()

Now let's load the CSV into the Pandas dataframe and draw it using Bokeh:

df_sample = pd.read_csv("mpu6050.csv", 
                        header=None, 
                        names=["timestamp", "g_x", "g_y", "g_z", "a_x", "a_y", "a_z"],
                        parse_dates=["timestamp"])
display(df_sample)

This sample contains the records, collected within 6 seconds, and the values look like this:

As we can see, we can get about 60 measurements per second from the sensor. Let's draw the data:

timestamps = df_sample['timestamp']
# Accelerometer data
p1 = figure(title="Accelerometer data", x_axis_type='datetime', 
            x_axis_label='x', y_axis_label='y', width=1600, height=600)
p1.line(timestamps, df_sample["a_x"], legend_label="A_X", line_width=2, color="red")
p1.line(timestamps, df_sample["a_y"], legend_label="A_Y", line_width=2, color="green")
p1.line(timestamps, df_sample["a_z"], legend_label="A_Z", line_width=2, color="blue")
# Gyroscope data
p2 = figure(title="Gyroscope data", x_axis_type='datetime', 
            x_axis_label='x', y_axis_label='y', width=1600, height=600)
p2.line(timestamps, df_sample["g_x"], legend_label="G_X", line_width=2, color="#AA8822")
p2.line(timestamps, df_sample["g_y"], legend_label="G_Y", line_width=2, color="#AA88AA")
p2.line(timestamps, df_sample["g_z"], legend_label="G_Z", line_width=2, color="#2288AA")
show(column(p1, p2))

By the way, the Bokeh library is good for plotting data like this. It turned out that, at least on my computer, Matplotlib practically "died" when the number of points became larger than several thousand. At the same time, Bokeh can handle even 1 million records in one graph.

The output looks like this:

Accelerometer and gyroscope data, Image by author

It is also important to understand the difference between a gyroscope and an accelerometer. An accelerometer (top graph) is measuring static acceleration, including force from the Earth's gravity. In this example, I was slowly rotating the board in my hand, and all three X, Y, and Z axes were proportionally changing. A gyroscope measures the instantaneous momentum around each axis. This data looks like a derivative of the accelerometer data; when the movement starts, there is a peak, then the values are going back to zeros.

I was not going to analyze step data in this article, but some readers may still be curious about what the raw sensor data of human steps looks like:

As we can see, especially on the gyroscope data graph, the pattern is quite straightforward to detect. And for the purpose of this article, it is more challenging to see if we can detect much slighter variations in the data, like the vibrations of the building.

Data Analysis

In the previous steps, we saw how to make a simple application to collect sensor data and got a general insight into what this data looks like. Now, let's see in more detail what interesting patterns can be found. For all further examples, I will be using data collected within 24 hours.

1. Timestamp Accuracy

As a reminder, sensor data was collected using a Python application running on a Raspberry Pi. The Raspberry Pi itself is running Linux, which is not a real-time OS. First, let's see the timestamp accuracy we have.

df = pd.read_csv("data/2023-08-06.log", header=None, 
                 names=["timestamp", "g_x", "g_y", "g_z", "a_x", "a_y", "a_z"], parse_dates=["timestamp"])

t_diff = df["timestamp"].diff().dt.total_seconds()[1:]
diff_mean = t_diff.mean()
print(diff_mean, 1/diff_mean)

#> 0.0156 63.81

I have collected 5,513,693 records within 24 hours; the total file size is about 500 MBytes. As we can see, the average difference between timestamps is 0.015 s, and the average fps is about 64. But how consistent is it? Let's create a histogram of the time difference:

t_diff = df["timestamp"].diff().dt.total_seconds()[1:]

h, bins = np.histogram(t_diff.values, bins=1024)
print(list(zip(h, bins))[:100])

#> [(159712, 0.010), (5349277, 0.015), (4134, 0.0199), (293, 0.02462), 
#>  (96, 0.0293), (28, 0.0339), (10, 0.0386), (7, 0.043), (21, 0.048),
#>  ...
#>  (1, 0.1650), (1, 0.1697), (0, 0.1743), (1, 0.1790), (0, 0.1837), ...]

# Convert X to milliseconds and normalize Y to 0..100%
n_total = sum(h)
h = 100*h/n_total
bins *= 1000

# Create the bar plot
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
ax.bar(bins[0:8], h[0:8], color='#440154', width=2.0)
ax.yaxis.label.set_color('gray')
ax.spines['left'].set_color('#DDDDDD')
ax.spines['right'].set_color('#DDDDDD')
ax.spines['top'].set_color('#DDDDDD')
ax.spines['bottom'].set_color('gray')
ax.xaxis.label.set_color('black')
ax.yaxis.label.set_color('gray')
ax.tick_params(axis='y', colors='#303030')
plt.xlabel("Timestamp difference, milliseconds")
plt.ylabel("Percentage of records")
plt.title("Timestamps accuracy")
plt.show()

We can see that indeed, Raspbian is not an RTOS, but the accuracy is good enough for our task:

Timestamps difference in milliseconds, Image by author

5,349,277 records (more than 70%) have about 0,015 s (15 ms) of delay, and only less than 50 records (0.001%) have intervals longer than 0,01 s (100 ms).

2. Sonogram

Let's go to the more fun stuff. We obviously cannot analyze 5 million records with the naked eye. Let's build a sonogram, which will allow us to see if there are some anomalies in the frequency domain:

def draw_sonogram(df_out: pd.DataFrame, t_start: datetime.datetime, t_end: datetime.datetime):
    """ Draw a sonogram from the dataframe """
    values = df_["g_y"].values
    t_diff = df_['timestamp'].diff().dt.total_seconds()[1:].mean()  # 0.015s ~ 50Hz

    fig, ax = plt.subplots(1, 1, figsize=(24, 14))
    ax.specgram(values, NFFT=256, Fs=1/t_diff, noverlap=50, scale="dB")
    plt.ylabel('Frequency, Hz')
    plt.xlabel('Time, sec')
    plt.show()

draw_sonogram(df, datetime.time(9,0,0), datetime.time(10,0,0))

The sonogram is based on the Fast Fourier Transform (FFT), which converts the values from a "time-based" domain to a "frequency" domain. The maximum frequency on the sonogram is about 30 Hz, which is equal to half of the sampling rate according to Nyquist's theorem. Doing the calculations manually may require a lot of work, but Matplotlib's "specgram" method does it all for us. The result looks like this:

Sonogram of the sensor data, Image by author

As we can see, there are some spots on the graph, but the building where I live has no clearly visible resonant frequencies. But for other types of construction (motors, machines, bridges, etc.), this type of analysis can be useful.

3. Heatmap

If we want to find some patterns in the vibration data, it makes sense to see the signal amplitude on the timeline. But there are too many records, and plotting them on one line will not be effective. In this case, the heatmap will be much better.

To make the heatmap, I preprocessed the data using three steps:

  • Normalization. I extracted the mean and took the absolute value:
df_ = df.copy()
df_["g_y_norm"] = (df_["g_y"] - df_["g_y"].mean()).abs()
  • Taking the maximum for the rolling period. This part is a bit more tricky. In the vibration data, short 1–2 s peaks can occur. Such small peaks will not be visible on the 24-hour timeline, so I decided to take a "rolling maximum" using this code:
N = 400
df_["g_y_roll"] = df_['g_y_norm'].rolling(N).max()

Practically, it looks like this:

Data processing example, Image by author

In this example, I set the number of samples for rolling to 400. Instead of a short (violet) peak, we will have a much bigger spot on the heatmap.

The heatmap itself can be displayed using the "heatmap" method in Seaborn. The full code with preprocessing and drawing is presented below:

import seaborn as sns

def draw_heatmap(df: pd.DataFrame):
    """ Draw a heatmap from a dataframe """
    # Normalization and applying the rolling maximum
    df_ = df.copy()
    N = 400
    df_["g_y_norm"] = (df_["g_y"] - df_["g_y"].mean()).abs()
    df_["g_y_roll"] = df_['g_y_norm'].rolling(N).max()
    df_ = df_.iloc[::N, :]  # Keep each Nth element

    # Reshape all items to (24, N) matrix for heatmap
    items_all = df_["g_y_roll"].values[2:]
    items_per_hour = items_all.shape[0]//24
    items_reshaped = items_all[:items_per_hour*24].reshape((24, -1))

    # Horizontal labels
    hor_ticks = 6

    # Draw
    fig, ax = plt.subplots(figsize=(30, 8))
    sns.heatmap(items_reshaped, vmin=0, vmax=0.08, 
                cbar_kws={"orientation": "vertical", "pad": 0.01}, ax=ax)
    ax.hlines(list(range(24)), *ax.get_xlim(), colors="#303030")
    plt.xticks(rotation=0)
    ax.set_xticks(np.linspace(0, items_per_hour, hor_ticks+1))
    ax.set_xticklabels([10*n for n in range(hor_ticks+1)])
    plt.title('MPU6050 Vibration Levels', fontsize=16)
    plt.xlabel('Minutes', fontsize=12)
    plt.yticks(rotation=0)
    plt.ylabel('Hours', fontsize=12)
    plt.show()

draw_heatmap(df)

The final image looks like this:

MPU6050 vibration data heatmap, Image by author

As we can see, getting all 24-hour data in one image is much more "explanatory". For example, we can easily see the vibrations from the transport, which occurred in the morning at 9 a.m. and in the evening between 4 and 6 p.m.

I did not implement brightness adjustment; it can be changed manually in the code by tuning the parameters vmin and vmax of the sns.heatmap call. The data itself has no gaps, and processing of the missing values is not implemented here.

4. Anomalies Detection

On the heatmap, we can see some interesting patterns, like the vibration caused by transport in the evening time. And we can also see some bright white spots – it is interesting to know what is it.

To detect such "anomalies", we will try two approaches. First, we can just find the data that is bigger than a threshold. As a second approach, we can use ready-to-use libraries like Python Outlier Detection (PyOD). Let's test both!

Filtering with a threshold is straightforward. As a threshold, I selected a large value (7 standard deviations), so the probability of getting such levels of vibration by chance is minuscule. As for filtering itself, Pandas already have all the needed methods:

df_ = df.copy()
df_["g_y_norm"] = (df_["g_y"] - df_["g_y"].mean()).abs()
std_y = df_["g_y_norm"].std()
threshold = 7*std_y

df_filtered = df_.index[df_['g_y_norm'] >= threshold]
print(df_filtered)
# > [2087075, 2153277, 2153981, 2798119, 2800170, 2800171, 
# > 3065854, 3065855,3065856, 3065858]

As an output, we get array indexes. But some of the items are too close; for example, indexes 3065854 and 3065855 definitely represent the same event. To filter the array, I created a helper method to remove redundant items:

def shrink_array(data: Any, num_samples: int) -> List:
    """ Remove too close items from array. Example: [1, 2, 3, 10, 20] => [1, 10, 20] """
    out = data[:1]
    for val in data[1:]:
        if val > out[-1] + num_samples:
            out.append(val)
    return out

indexes = shrink_array(df_filtered.values.tolist(), num_samples=500)
print(indexes)
# > [2087075, 2153277, 2153981, 2798119, 2800170, 3065854]

Here the parameter "num_samples" is used as a criterion; all array items closer than this value will be removed from the list.

Now, we can show results with Bokeh:

from bokeh.layouts import gridplot

def make_plot(df_out: pd.DataFrame, title: str):
    """ Show graph data """
    timestamps = pd.to_datetime(df_out['timestamp'].values).to_pydatetime()
    p = figure(title=title, x_axis_type='datetime', 
               x_axis_label='x', y_axis_label='y', 
               width=600, height=400)
    p.line(timestamps, df_out["g_y"].values - df_out["g_y"].mean(), 
           legend_label="G_Y", line_width=1, color="blue")
    p.xaxis.formatter=DatetimeTickFormatter(seconds="%H:%M:%S")
    p.y_range = Range1d(-1.0, 1.0)
    return p

plots = []
for ind in indexes:
    plots.append(make_plot(df_[ind - 20:ind + 100], title=f"Index={ind}"))

show(gridplot(np.array_split(plots, 2)))

The output looks like this:

Anomalies, detected by a threshold, Image by author

As the last step of this article, let's find anomalies using the Python Outlier Detection (PyOD) library. This library has more than 40 algorithms implemented; I will show one of them just to give readers an idea of how it works. I will be using a proximity-based KNN (k Nearest Neighbors) algorithm, which uses the distance to the kth nearest neighbor as the outlier score.

First, we need to fit the algorithm using some data. To do this, I've used one of the indexes that were found before:

from pyod.models.knn import KNN

pos_train = 2087075
df_ = df[pos_train - 100000:pos_train + 100000]
fit_data = df_[["g_x", "g_y", "g_z"]].to_numpy()

clf = KNN(contamination=0.0001)
clf.fit(fit_data)

As we can see, there are two major differences between using PyOD and my "naive" approach. First, PyOD can analyze multivariate data, so we can use all three axes from the sensor. Second, based on our domain knowledge, we need to specify a contamination rate. I was looking for very rare and short events, which can occur once every several hours, so I put this value at 0.0001.

When the detector is trained, we can simply use a "predict" method to process another set of data and get the results:

pos_test = 2800170
df_test = df[["g_x", "g_y", "g_z"]][pos_test - 5000:pos_test + 5000]
data = df_test.to_numpy()
y_pred = clf.predict(data)  # Outlier labels (0 or 1)

To see the results in a visual form, let's draw inputs and predictions on the same graph:

# Draw
x = np.arange(0, len(y_pred))
y = y_pred
y1 = df_test["g_x"]
y2 = df_test["g_y"]
y3 = df_test["g_z"]
p = figure(title="KNN anomaly detection results", 
           x_axis_label='x', y_axis_label='y', 
           width=1600, height=500)
p.line(x, 0.04*y, legend_label="Anomaly", line_width=2, color="gray")
p.line(x, y1, legend_label="g_x", line_width=2, color="green")
p.line(x, y2, legend_label="g_y", line_width=2, color="red")
p.line(x, y3, legend_label="g_z", line_width=2, color="blue")
show(p)

Here the red, green, and blue lines are representing the sensor data, and the gray line is a prediction result, the small peaks are showing indexes where the outliers were detected:

Anomaly detection with PyOD, Image by author

It works. And as was written before, there are more than 40 algorithms available in PyOD. Readers who wish are welcome to test others on their own. If someone would like to test them on the same dataset, write in the comments below, and I will share a temporary link.

Conclusion

In this article, I explained how to connect the MPU6050 MEMS sensor to a Raspberry Pi single-board computer and collect vibration data from a building within several days. Then we analyzed this data in different ways, like drawing raw data on a timeline and on a heatmap, building a sonogram in the frequency domain, and applying an anomaly detection algorithm. This can be interesting for the research of the modern urban environment; for example, the vibrations caused by traffic were clearly visible on the graph (I was actually surprised that the MPU6050, which was designed mostly for smartphones and game controllers, can reliably detect such slight vibrations). It can even be possible to detect rare events like earthquakes, though for such events, having data from at least two places would be more reliable. Analysis of vibrations can also be useful to predict failures in machines like motors or turbines. The number of possible applications is actually large.

In general, it was interesting to make this experiment and to work with the "real" hardware and its data. Alas, I must admit that the number of articles and posts about Data Science and data analysis in hardware and IoT is minuscule, not only on TDS but also on other websites. I hope this story can change this disproportion a bit and show readers that working with hardware can be fun.

Thanks for reading. If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors.

Tags: Data Science Data Visualization Hands On Tutorials Programming Sensors

Comment