Aggregating and Grouping Data
SELECT
command allows you to choose the fields that you want returned from your databaseFROM
pinpoints the table that you want to pull fromTip
Don’t forget to end your query with semicolon. 🙃
WHERE
to only return certain criteriaSELECT desserts
FROM things_in_not_supposed_to_have
WHERE (calories <= 1000) AND (has_chocolate = TRUE);
ORDER BY
/*
This is how I query my table of desserts based on some my sophisticated criteria
which minimizes calories, prioritizes chocolate, and orders by regret after
eating.
*/
SELECT desserts
FROM things_in_not_supposed_to_have
WHERE (calories <= 1000) AND (has_chocolate = TRUE)
ORDER BY regret_factor ASC;
CSG Justice Center’s National Reports Database:
Let’s take a look:
Important
Tables with spaces in the name should be wrapped in [
and ]
.
COUNT()
: returns the number of rows that matches a specified criterion.AVG()
: returns the average value of a numeric column.SUM()
: returns the total sum of a numeric column.Vera - Prison Numbers by State-County - 1990 to current
table:Write a query that the population for the prison population for the state of California in 2015?
You got time ⏱️, but not too much.
GROUP BY
clause in SQL is used to arrange identical data into group with the help of aggregate functions (e.g., SUM()
).So what is happening under-the-hood? Split-apply-combine.
GROUP BY
splits data into groups based on criteria.SUM
) is applied to each group independently.Write a single or multiple query(ies) that return(s):
You got time ⏱️, but not too much.
As your queries get more complex, and your calculations more verbose, the returned variable names will likely get long and unwieldy. To help make things clearer, we can use aliases to assign new names to returned values in the queries.
To do so, use AS
:
Tip
Technically, AS
isn’t required. So you could do:
SELECT State, Year,
SUM(Prison_Population_Total) Yearly_Prison_Population_Sum,
ROUND(AVG(Prison_Population_Total), 2) Yearly_Prison_Population_Avg,
FROM [Vera - Prison Numbers by State-County - 1990 to current]
WHERE State == "California"
GROUP BY Year, State;
However, I ask that you include it to improve clarity.
HAVING
isn’t WHERE
Previously, we discussed WHERE
, which allows you to filter the results of a query based on some criteria.
SQL offers another tool for filtering data based on aggregate functions, through the HAVING
keyword. For example, suppose that you only want to return observations generated with inputs for all 58 counties in California:
SELECT State, Year,
SUM(Prison_Population_Total) AS Yearly_Prison_Population_Sum,
ROUND(AVG(Prison_Population_Total), 2) AS Yearly_Prison_Population_Avg,
COUNT(Prison_Population_Total) AS Yearly_State_County_N
FROM [Vera - Prison Numbers by State-County - 1990 to current]
WHERE State == "California"
GROUP BY Year, State
HAVING Yearly_State_County_N = 58;
Write a query that returns observations for the CA yearly prison population, where that yearly number is below 100,000 and that number was generated from observations from all 58 counties.
You got time ⏱️, but not too much.
My answer is:
SELECT State, Year,
SUM(Prison_Population_Total) AS Yearly_State_Population_Sum,
-- ROUND(AVG(Prison_Population_Total), 2) AS Yearly_State_Population_Avg,
COUNT(Prison_Population_Total) AS Yearly_State_County_N
FROM [Vera - Prison Numbers by State-County - 1990 to current]
WHERE State == "California"
GROUP BY Year, State
HAVING Yearly_State_Population_Sum < 100000 AND
Yearly_State_County_N = 58;
What is yours?
Now that you are an SQL ninja 🥷, it’s time that we tackle how to handle repetitive operations.
“With great power comes great responsibility” – Ben Parker
Your new goal is to live by the DRY principle, or “Don’t Repeat Yourself”. This principle is foundation in software development but can be useful in building queries in SQL. The idea here is to reduce the repetition of code patterns and ensuring maintainability (code is in less places), readability (more compact code), and reusability (modular data).
One way to minimize repetition is by creating and using views. Views are are a form of query that can be saved in the database, and can be used at a later time, filtered, or even updated.
Suppose that your project requires reporting on California periodically. That query would look like:
SELECT *
FROM [Vera - Prison Numbers by State-County - 1990 to current]
WHERE State == "California",
But if we don’t want to type that every time (think DRY 🥷), you can create a view:
GROUP BY
keyword to aggregate data.MIN
, MAX
, AVG
, SUM
, COUNT
, etc. operate on aggregated data.AS
keyword when creating aliases.HAVING
keyword to filter on aggregate properties.VIEW
to access the result of a query as though it was a new table.This presentation was largely adapted from the wonderful course “Data Management with SQL for Ecologist” by datacarpentry.org and Cathy Tanimura’s “SQL for Data Analysis” book.
For Internal Use Only