Combining Data With Joins
SELECT
command allows you to choose the fields that you want returned from your databaseFROM
pinpoints the table that you want to pull fromTip
Don’t forget to end your query with semicolon. 🙃
LIMIT
clause to minimize how much you query…Comments are you friends 👬
/*
This is how I query my table of desserts based on some my sophisticated criteria
which minimizes calories, prioritizes chocolate, and orders by regret after
eating.
*/
SELECT desserts
FROM things_i_am_not_supposed_to_have
WHERE (calories <= 1000) AND (has_chocolate = TRUE)
ORDER BY regret_factor ASC;
📣 Document, document, document! 📣
…please🙏, please🙏, please🙏 document.
SUM()
, AVG()
, COUNT()
)-- Returns the sum of all values in the calories variable
SELECT SUM(calories)
FROM things_i_am_not_supposed_to_have;
GROUP BY
, which allows us to break down data into smaller, more manageable groups, before aggregating these values using a function.Up to this point, we have worked with a single table.
Database tables are used to organize and group data by common characteristics or principles.
For example, in a corrections setting you may keep data on people (e.g., name, SSN, height, location) in one table and information on sentencing in another.
Often, we need to combine elements from separate tables into a single tables or queries for analysis and visualization.
A JOIN
is a means for combining columns from multiple tables by using values common to each.
You can find examples online, for example this cheat sheet.
Your tables are connected by keys 🔑.
There are 4 main types of joins…
…but we will only cover inner and left joins. 🎉
CSG Justice Center’s National Reports Database:
🎯 Our goal in this example is simple, calculate the violent crime rate per 100,000 for all states using the most up-to-date values we can get.
\[ \text{Violent Crime Rate per 100} = \frac{Number \ of \ Reported \ Violent \ Crimes}{Total \ Population} \times 100,000 \]
Easy enough 😌. Now let’s see the data 😨.
Let’s take our first look:
What do we got?
Yes, we already have a rate and the population 😂, but we will ignore these in this example.
Mini challenge⚡
(Yes, you may Google away)
YEAR
values using only SQL? Violent Crime
, YEAR
, and State
variables? You got time ⏱️, but not too much.
Let’s take our first look:
What do we got?
Mini challenge⚡
(Yes, you may Google away)
Year
values using only SQL? State
, Year
, and Total
variables. Gender
? You got time ⏱️, but not too much.
We got two queries, how do I use both of them?
Recall our friend VIEW
:
-- Create view
CREATE VIEW temp_pop AS
SELECT pop.State AS st, pop.Total AS pop_total, pop.Year AS pop_yr
FROM [Census - Population by Sex-Age-Race - 2005 to current] pop
WHERE ([Age Group] == 'Total all ages') AND (pop.Gender LIKE 'Total');
-- Select all values in view
SELECT *
FROM temp_pop;
-- Create view
CREATE VIEW temp_crime AS
SELECT State AS st, crime.[Violent Crime] AS crime, YEAR AS yr
FROM [UCR - Crime Volume & Rates - 2001 to current] crime
WHERE crime.YEAR = 2020.0;
-- Select all values in view
SELECT *
FROM temp_crime;
With our views at our commands we can more easily explore joins 🚀!
The JOIN
keyword combined with ON
is used to combine fields from separate tables.
A JOIN clause on its own will result in a cross product, where each row in the first table is paired with each row in the second table. Usually this is not what is desired when combining two tables with data that is related in some way.
For that, we need to tell the computer which columns provide the link between the two tables using the word ON. What we want is to join the data with the same species id.
What did we get? Does it look right?
JOIN
clause without any modifier defaults to INNER JOIN
, which means it will include only the rows where there is a match in both tables.The LEFT JOIN
clause tells the computer that we wish to keep all the records in the first table.
We use it to ensure that all records from the left table are included in the result set, regardless of whether there are matching records in the right table. Also, useful for finding rows in the left table that do not have corresponding rows in the right table.
Use the LEFT
keyword in addition to JOIN
:
Or:
Did it work?
Nop!😡 One to many matching.
You can add additional keys to match on using the AND
clause:
How does that look?
It worked! 💃 (but we got to do some cleaning)
First, SELECT
only the variables you want using the aliases and variable names:
Don’t forget, we are joining to compute Violent Crime Rate per 100.
Just a little more SQL:
-- Create view
CREATE VIEW temp_values AS
SELECT DISTINCT tc.st, tc.crime, tc.yr, tp.pop_total
FROM temp_crime tc
LEFT JOIN temp_pop tp
ON tp.st = tc.st AND tp.pop_yr = tc.yr;
-- Compute from view:
SELECT *, (crime / pop_total) * 100000 AS vc_rate
FROM temp_values;
Mini challenge⚡
(Yes, you may Google away)
You got time ⏱️, but not too much.
Your goal is to create a table containing the total crime and total arrest rates per 100,000 for all ages and genders in 2020. You will once again ignore the UCR state population estimates in favor of the Census estimates for 2020.
To be clear, you will need the following three tables and formulas:
UCR - Crime Volume & Rates - 2001 to current
UCR - Arrest Volume & Rates - 1999 to current
Census - Population by Sex-Age-Race - 2005 to current
\[ \text{Total Crime Rate per 100} = \frac{Number \ of \ Reported \ Violent \ Crimes}{Total \ Population} \times 100,000 \]
\[ \text{Total Arrests Rate per 100} = \frac{Number \ of \ Arrests (All \ Classes)}{Total \ Population} \times 100,000 \]
You got time ⏱️, but not too much. You may work in teams 👥. Good luck🍀!
-- Population view
CREATE VIEW POP AS
SELECT DISTINCT pop.State AS st, pop.Total AS pop_total, pop.Year AS pop_yr
FROM [Census - Population by Sex-Age-Race - 2005 to current] pop
WHERE ([Age Group] == 'Total all ages') AND (pop.Gender LIKE 'Total') AND (Year = 2020.0);
-- Crime view
CREATE VIEW CRIME AS
SELECT DISTINCT State AS st, crime.[Violent Crime] AS crime, YEAR AS yr
FROM [UCR - Crime Volume & Rates - 2001 to current] crime
WHERE crime.YEAR = 2020.0;
-- Arrests view
CREATE VIEW ARRESTS AS
SELECT DISTINCT arrests.State AS st, arrests.Year AS arr_yr, arrests.[Total all classes] AS arrests
FROM [UCR - Arrest Volume & Rates - 1999 to current] arrests
WHERE (arrests.Year = 2020.0) AND (arrests.[Age group] LIKE 'Total all ages');
-- Join and clean
SELECT tp.st AS state, tp.pop_yr AS year, tp.pop_total AS population, tc.crime, arr.arrests,
(tc.crime / tp.pop_total) * 100000 AS vc_rate,
(arr.arrests / tp.pop_total) * 100000 AS arr_rate
FROM POP tp
LEFT JOIN CRIME tc
ON tp.st = tc.st AND tp.pop_yr = tc.yr
LEFT JOIN ARRESTS arr
ON tp.st = arr.st AND tp.pop_yr = arr.arr_yr;
CREATE VIEW
and AS
to generate custom views of your data prior to joining them.JOIN
clause to combine data from two tables, the ON
keyword specifies which columns link the tables.JOIN
returns only matching rows. Other join clauses provide different behavior, e.g., LEFT JOIN
retains all rows of the table on the left side of the clause.This presentation was largely adapted from the wonderful course “Data Management with SQL for Ecologist” by datacarpentry.org and Cathy Tanimura’s “SQL for Data Analysis” book.
For Internal Use Only