- Anomaly Detection using Sigma Rules (Part 1): Leveraging Spark SQL StreamingSigma rules are used to detect anomalies in cyber security logs. We use Spark structured streaming to evaluate Sigma rules at scale.
- 24858Murphy ≡ DeepGuide
- Anomaly Detection using Sigma Rules (Part 2) Spark Stream-Stream JoinA class of Sigma rules detect temporal correlations. We evaluate the scalability of Spark's stateful symmetric stream-stream join to...
- 29959Murphy ≡ DeepGuide
- Anomaly Detection using Sigma Rules (Part 3) Temporal Correlation Using Bloom FiltersCan a custom tailor made stateful mapping function based on bloom filters outperform the generic Spark stream-stream join?
- 29937Murphy ≡ DeepGuide
- Hands-On Introduction to Delta Lake with (py)SparkConcepts, theory, and functionalities of this modern data storage framework
- 26797Murphy ≡ DeepGuide
- Anomaly Detection using Sigma Rules (Part 4): Flux Capacitor DesignWe implement a Spark structured streaming stateful mapping function to handle temporal proximity correlations in cyber security logs
- 23332Murphy ≡ DeepGuide
- Creating a Data Pipeline with Spark, Google Cloud Storage and Big QueryOn-premise and cloud working together to deliver a data product
- 28026Murphy ≡ DeepGuide
- Anomaly Detection using Sigma Rules (Part 5) Flux Capacitor OptimizationTo boost performance, we implement a forgetful bloom filter and a custom Spark state store provider
- 23388Murphy ≡ DeepGuide
- Anomaly Detection Using Sigma Rules: Build Your Own Spark Streaming DetectionsEasily deploy Sigma rules in Spark streaming pipelines: a future-proof solution supporting the upcoming Sigma 2 specification
- 24348Murphy ≡ DeepGuide
- Optimizing Output File Size in Apache SparkA Comprehensive Guide on Managing Partitions, Repartition, and Coealesce Operations
- 25855Murphy ≡ DeepGuide
- Parallelising Python on Spark: Options for concurrency with PandasPhoto by Florian Steciuk on Unsplash In my previous role, I spent some time working on an internal project to predict future disk storage space usage for our Managed Services customers across thousands of disks. Each disk is subject to its own usage patte
- 28908Murphy ≡ DeepGuide
- 1.5 Years of Spark Knowledge in 8 TipsMy learnings from Databricks customer engagements
- 24821Murphy ≡ DeepGuide
- Unleashing the Power of SQL Analytical Window Functions: A Deep Dive into Fusing IPv4 BlocksHow to summarize a geolocation table by merging contiguous network IPv4 blocks
- 29337Murphy ≡ DeepGuide
- End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and DockerBuilding a Practical Data Pipeline with Kafka, Spark, Airflow, Postgres, and Docker
- 28880Murphy ≡ DeepGuide
- Performant IPv4 Range Spark JoinsA Practical guide to optimizing non-equi joins in Spark
- 25059Murphy ≡ DeepGuide
- 4 Examples to Take Your PySpark Skills to Next LevelGet used to large-scale data processing with PySpark
- 22681Murphy ≡ DeepGuide
- Building a Semantic Book Search: Scale an Embedding Pipeline with Apache Spark and AWS EMR…Using OpenAI's Clip model to support natural language search on a collection of 70k book covers
- 25325Murphy ≡ DeepGuide
- Delta Lake – Type wideningWhat is type widening and why does it matter?
- 27229Murphy ≡ DeepGuide
- Apache Hadoop and Apache Spark for Big Data AnalysisA complete guide to big data analysis using Apache Hadoop (HDFS) and PySpark library in Python on game reviews on the Steam gaming...
- 21450Murphy ≡ DeepGuide
- Feature Engineering for Time-Series Using PySpark on DatabricksDiscover the potentials of PySpark for time-series data: Ingest, extract, and visualize data, accompanied by practical implementation codes
- 27226Murphy ≡ DeepGuide
- Orchestrating a Dynamic Time-series Pipeline in AzureExplore how to build, trigger, and parameterize a time-series data pipeline with ADF and Databricks, accompanied by a step-by-step tutorial
- 20845Murphy ≡ DeepGuide
 1 2
We look at an implementation of the HyperLogLog cardinality estimati
Using clustering algorithms such as K-means is one of the most popul
Level up Your Data Game by Mastering These 4 Skills
Learn how to create an object-oriented approach to compare and evalu
When I was a beginner using Kubernetes, my main concern was getting
Tutorial and theory on how to carry out forecasts with moving averag
