Shaping Your Data with SQL

Author:Murphy | View: 25458 | Time: 2025-03-23 18:54:20

Why is Data Shaping?

There is no one-size-fits-all data. For different purposes and used cases, data is customized accordingly. The more you are aware of the intended use of data afterward, the more equipped you will be to correctly present your data to the end users.

For example, the data utilized to undertake in-depth analysis differ from the aggregated data provided to upper management.

Another example, the business development team is more concerned with the overall cost per area to convert new users than the marketing managers are with affiliate marketing costs, which they want to look at specifically by region.

That being said, transforming existing data structures into any alternative pivoted or unpivoted structure is an indispensable step in any data manipulation and analytical process.

In this article, I will introduce a few techniques to shape and slice your data in some specific cases. Usually, I will present my examples with PostgreSQL.

Now, let's start to see what we've got!

Data

I will use the data from World Happiness Report 2015–2021 in this article. This dataset provides the happiness levels of countries around the world based on different indicators: economic growth, social support, life expectancy at birth, etc. The data is available on Kaggle with the CC0: Public Domain license. As seen in the image below, I will just utilize a few fields for this post, which are:

Country name
Year: Report year (2005–2021)
Life ladder: The best potential life for each respondent is represented by a 10 on a ladder, while the worst possible life is represented by a 0. Each participant is then asked to rank their current life on that ladder.
Log GDP per capita: Log of the GDP per person in terms of purchasing power parity (PPP) adjusted dollars.
Social support: The perception of social support (being able to rely on others) in the nation.
Healthy life expectancy at birth: The average lifespan of a country's citizens at a certain period.

Using Window Functions

Rolling Calculation with PRECEDING AND CURRENT ROW

Example: Showing the **** rolling 3-year average life ladder index for each country

What is the three-year rolling average? Simply, it calculates the average life ladder score for the last two years plus the current year. If the current year is 2010, for instance, the three-year rolling average of the life ladder score of each country will be equal to the average scores of that country in 2008, 2009, and 2010. As you can see in the picture below, the ‘rolling_average' in 2010 of Afghanistan is 4.29, and this value is the mean of 3.72, 4.40, and 4.76.

To specify the range of rows to be considered in the calculation, window functions in SQL are used with PRECEDING and CURRENT ROW. To be specific, PRECEDING determines the number of rows that are before the CURRENT ROW.So, in the conjunction with PARTITION BY the country and ORDER BY year, the SQL command below returns the rolling 3-year average life ladder score for each country in the ‘rolling_average' column.

SELECT year, country_name, life_ladder,
  AVG(life_ladder) OVER (
    PARTITION BY country_name
    ORDER BY year
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS rolling_average
FROM public.happiness_index;

Over The Period Calculation with UNBOUNDED PRECEDING AND CURRENT ROW

Example: Showing the **** over-period average life ladder score for each country

For this calculation to happen, we use UNBOUNDED PRECEDING AND CURRENT ROW. The calculation window will include the current value and all the rows up to the current rows.

For instance, if the current year is 2009, a country's average score over the period will be equal to the average of its scores in 2008 and 2009. Likewise, in 2010 the average life ladder score will be determined by dividing the scores from 2008, 2009, and 2010 by 3. You can see the command below for more reference.

SELECT year, country_name, life_ladder,
AVG(life_ladder) OVER (
    PARTITION BY country_name
    ORDER BY year
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS rolling_average
FROM public.happiness_index

Percentage Calculation with UNBOUNDED PRECEDING AND CURRENT ROW

Example: Calculate the percentage change of life ladder point in comparison with the cumulative average.

Similar to the previous example, which used UNBOUNDED PRECEDING AND CURRENT ROW, but instead of computing the average value across time alone, this example is concerned with the percentage change between the current values and the period average. In this case, the results are stored in the fourth column as in the picture below. You can easily observe which year actually has a positive/negative change compared to the rolling average. Also, this indicator tells us about the magnitude of the change in the target value.

SELECT 
  year, 
  country_name, 
  life_ladder, 
  100 * (life_ladder - AVG(life_ladder) OVER (PARTITION BY country_name ORDER BY year ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) / AVG(life_ladder) OVER (PARTITION BY country_name ORDER BY year ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS percentage_change
FROM 
  public.happiness_index

Actually, in time series analysis, this method is one of the important approaches for identifying data trends or patterns across time. For example, it is most frequently used in sales analysis, where sales growth or market share are of primary concern.

Percentile Calculation with PERCENTILE_RANK()

Example: Identifying the percentile of the GDP per capita of each country in the region

In this case, I'm interested in knowing which countries have the better GDP per capita in their respective regions in 2008. The task is simply done with the support of PERCENTILE_RANK() the function.

SELECT country_name, regional_indicator, log_gdppercapita,
  round((PERCENT_RANK() OVER (
    PARTITION BY regional_indicator
    ORDER BY log_gdppercapita
  ))::numeric,2) AS percentile_rank
FROM public.happiness_index
where year = 2008;

So, as you may observe, by PARTITION BY the region and ORDER BY the GDP per capita, the command put the countries into different percentile categories based on their rank of GDP per capita. For instance, according to data, Latvia's GDP per capita is higher than 33% of countries in Central and Eastern Europe.

Using CASE WHEN in conjunction with aggregate function to transform data

Example: Displaying the average life ladder score for each year in each continent

In this case, I will categorize the countries into 4 big different regions such as Asia, Europe, Africa, and America based on the ‘regional_indicator' field. We can get the average value of the life ladder corresponding to each location by first identifying each area using CASE WHEN.

SELECT year, 
ROUND(AVG(CASE WHEN regional_indicator LIKE '%Asia%' THEN life_ladder 
  ELSE null END)::numeric,2) AS Asia, 
ROUND(AVG(CASE WHEN regional_indicator LIKE '%Europe%' THEN life_ladder 
  ELSE null END)::numeric,2) AS Europe, 
round(AVG(CASE WHEN regional_indicator LIKE '%Africa%' THEN life_ladder 
  ELSE null END)::numeric,2) AS Africa,
round(AVG(CASE WHEN regional_indicator LIKE '%America%' THEN life_ladder 
  ELSE null END)::numeric,2) AS America
FROM public.happiness_index
GROUP BY 1
ORDER BY 1

You can see that the life ladder scores of 4 different regions saved in rows have been transformed into 4 distinct columns. This data transformation makes it simple for analysts to monitor the variations in values across different locations within a given year. It's also useful to track how an area's data changes over time in a certain period.

Unpivoting data with UNION ALL

Example: unpivoting the table in the previous data set

How do we make the table in the previous example above into the table below?

Analytics requires flexible techniques of data transformation since it enables you to see your data in any dimension and gain more insightful information. As you see, the preceding example demonstrates how the result data table gives us insights both when we view the data vertically and when we view it horizontally.

In this case, I will present a simple transformation of how to unpivot data to its original state with UNION ALL. One thing to note when using UNION ALL is that the number of columns and the data types in each component used for the union must be compatible.

WITH tbl AS 
(SELECT year, 
ROUND(AVG(CASE WHEN regional_indicator LIKE '%Asia%' THEN life_ladder 
  ELSE null END)::numeric,2) AS Asia, 
ROUND(AVG(CASE WHEN regional_indicator LIKE '%Europe%' THEN life_ladder 
  ELSE null END)::numeric,2) AS Europe, 
round(AVG(CASE WHEN regional_indicator LIKE '%Africa%' THEN life_ladder 
  ELSE null END)::numeric,2) AS Africa,
round(AVG(CASE WHEN regional_indicator LIKE '%America%' THEN life_ladder 
  ELSE null END)::numeric,2) AS America
FROM public.happiness_index
GROUP BY 1
ORDER BY 1)

SELECT year, 
'Asia' AS region
, asia AS avg_life_ladder
FROM tbl
    UNION ALL
SELECT year, 
'Europe' AS region
, europe AS avg_life_ladder
FROM tbl
    UNION ALL
SELECT year, 
'Africa' AS region
,africa AS avg_life_ladder
FROM tbl
    UNION ALL
SELECT year, 
'America' AS region
, america AS avg_life_ladder
FROM tbl
;

Pivoting data with UNPIVOTING and PIVOTING functions

UNNEST function to unpivot data

Example: the **** same requirement as the previous case

Knowing more than one way of playing with Data to produce the same result is essential since it enables greater active control over the data. That being said, besides UNION ALL , UNNEST() is another function to unpivot data. The use of UNNEST() is help to convert an array column into different rows.

WITH tbl AS 
(SELECT year, 
ROUND(AVG(CASE WHEN regional_indicator LIKE '%Asia%' THEN life_ladder 
  ELSE null END)::numeric,2) AS Asia, 
ROUND(AVG(CASE WHEN regional_indicator LIKE '%Europe%' THEN life_ladder 
  ELSE null END)::numeric,2) AS Europe, 
round(AVG(CASE WHEN regional_indicator LIKE '%Africa%' THEN life_ladder 
  ELSE null END)::numeric,2) AS Africa,
round(AVG(CASE WHEN regional_indicator LIKE '%America%' THEN life_ladder 
  ELSE null END)::numeric,2) AS America
FROM public.happiness_index
GROUP BY 1
ORDER BY 1)

SELECT 
year, 
UNNEST(ARRAY['Asia', 'Europe', 'Africa', 'Ameria']) AS region,
UNNEST(ARRAY[asia, europe, africa, america]) AS life_ladder
FROM tbl
;

In our example, the array column of [‘Asia', ‘Europe', ‘Africa', ‘Ameria'] is now transformed back to the row values after unnesting.

CROSSTAB function to pivot data

Example: the same requirement with CASE WHEN example: Displaying the average life ladder score for each year in each continent

COSSTAB() is an intelligent way of pivoting data, transforming, and summarizing data in a matrix format. In this case, I will use this function to make the row values of Asia, Europe, Africa, America, and Commonwealth_of_Independent_States into different columns.

Sql">SELECT * FROM crosstab(
  'SELECT year, region, round(AVG(life_ladder)::NUMERIC,2)::FLOAT as life_ladder
   FROM (SELECT *, 
CASE WHEN regional_indicator LIKE ''%Asia%'' THEN ''Asia''  
     WHEN regional_indicator LIKE ''%Europe%'' THEN ''Europe''  
     WHEN regional_indicator LIKE ''%Africa%'' THEN ''Africa'' 
     WHEN regional_indicator LIKE ''%America%'' THEN ''America'' 
  ELSE ''Commonwealth_of_Independent_States'' END AS region
FROM public.happiness_index) a
  GROUP BY 1,2
   ORDER BY 1, 2') AS region_life_ladder (year int, Asia FLOAT, Europe FLOAT, Africa FLOAT, America FLOAT, Commonwealth_of_Independent_States FLOAT)