Introduction to SQL

Combining Data With Joins

Chris Callaghan

📖 Rules of engagement

  • The content in this presentation is not for distribution as content has not been vetted by our benevolent overloads.
  • The purpose of this document is to guide discussion, elicit feedback, and show you new ideas. The more you lean in, the more you win!
  • As a follow up on the last point, you can and should ask questions. The content should be a conversation, not a monologue.
  • Learning is a continuous journey, no one person’s journey is linear. Be kind 🤗. Be patient 🧘. Be inquisitive 🤔.
  • This is an opinionated guide, there are multiple ways to fry an egg 🍳.

Objectives 🎯️

  • Remind you of prior concepts and get you coding.
  • Employ joins to combine data from two tables.
  • Employ aliases to assign new names to tables and columns in a query.

First, a bit of a reminder

(Making sure you didn’t forget!)

🧩 The Basics

  • Using the SELECT command allows you to choose the fields that you want returned from your database
  • FROM pinpoints the table that you want to pull from
SELECT desserts
FROM things_i_am_not_supposed_to_have; 

Tip

Don’t forget to end your query with semicolon. 🙃

  • Databases can also be filtered using WHERE to only return certain criteria
SELECT desserts
FROM things_i_am_not_supposed_to_have
WHERE (calories <= 1000) AND (has_chocolate = TRUE); --Look: Individual clauses!

Be careful ⚠️

  • When exploring your data, use the LIMIT clause to minimize how much you query…
SELECT desserts
FROM things_i_am_not_supposed_to_have
LIMIT 1000;

❗❗❗Above all…

Comments are you friends 👬

  • Single-line friends🧍:
-- Selecting a balanced meal
SELECT desserts
FROM things_i_am_not_supposed_to_have
WHERE (calories <= 1000) AND (has_chocolate = TRUE);
  • Multi-line fancy friends 🤵🤵:
/*
This is how I query my table of desserts based on some my sophisticated criteria 
which minimizes calories, prioritizes chocolate, and orders by regret after 
eating.
*/
SELECT desserts
FROM things_i_am_not_supposed_to_have
WHERE (calories <= 1000) AND (has_chocolate = TRUE)
ORDER BY regret_factor ASC;

📣 Document, document, document! 📣

…please🙏, please🙏, please🙏 document.

More than Basics 🧩🧩

  • We can group records using SQL functions (e.g., SUM(), AVG(), COUNT())
-- Returns the count of rows
SELECT COUNT(*)
FROM things_i_am_not_supposed_to_have;

-- Returns the sum of all values in the calories variable
SELECT SUM(calories)
FROM things_i_am_not_supposed_to_have;
  • We can level up our analysis using GROUP BY, which allows us to break down data into smaller, more manageable groups, before aggregating these values using a function.
-- Returns a sum of caloric values for things with and without chocolate
SELECT SUM(calories)
FROM things_i_am_not_supposed_to_have
GROUP BY has_chocolate;
  • Aliases will make your code prettier! Much, much prettier.💅
SELECT SUM(things.calories) AS caloric_value
FROM things_i_am_not_supposed_to_have things
GROUP BY things.has_chocolate;

Back to Regular Programming 📺

🌱 Data Joins 101

Up to this point, we have worked with a single table.

Database tables are used to organize and group data by common characteristics or principles.

For example, in a corrections setting you may keep data on people (e.g., name, SSN, height, location) in one table and information on sentencing in another.

Often, we need to combine elements from separate tables into a single tables or queries for analysis and visualization.

A JOIN is a means for combining columns from multiple tables by using values common to each.

You can find examples online, for example this cheat sheet.

Recall 🖼️

Your tables are connected by keys 🔑.

🔗 Types of Joins

There are 4 main types of joins…

…but we will only cover inner and left joins. 🎉

A More “Interesting” Example 😂

🏁 Start your Engines

Meet the Data 👋

CSG Justice Center’s National Reports Database:

  • Repository of data from commonly used data sets, including UCR, NIBRS, BJS, ACS, among others
  • Updated regularly as data become available
  • We will use data from the Uniform Crime Reporting Program and from the US Census; namely, tables on crime and population.

🎯 Our goal in this example is simple, calculate the violent crime rate per 100,000 for all states using the most up-to-date values we can get.

\[ \text{Violent Crime Rate per 100} = \frac{Number \ of \ Reported \ Violent \ Crimes}{Total \ Population} \times 100,000 \]

Easy enough 😌. Now let’s see the data 😨.

👮 Crime Data

Let’s take our first look:

SELECT * 
FROM [UCR - Crime Volume & Rates - 2001 to current]
LIMIT 100;

What do we got?

Yes, we already have a rate and the population 😂, but we will ignore these in this example.

Mini challenge

(Yes, you may Google away)

  • Can you get the min and max YEAR values using only SQL?
  • Can you isolate the records for the most recent year and only include in your return the Violent Crime, YEAR, and State variables?

You got time ⏱️, but not too much.

Population Data 👥

Let’s take our first look:

SELECT * 
FROM [Census - Population by Sex-Age-Race - 2005 to current]
LIMIT 100;

What do we got?

Mini challenge

(Yes, you may Google away)

  • Can you get the min and max Year values using only SQL?
  • Can you isolate the records for all years but only those representing the total for all ages? Also, only return the State, Year, and Total variables.
  • Can you refine the prior query to only include “Total” observations for Gender?

You got time ⏱️, but not too much.

🔪Join Ingredient Prep

We got two queries, how do I use both of them?

Recall our friend VIEW:

-- Create view
CREATE VIEW temp_pop AS
SELECT pop.State AS st, pop.Total AS pop_total, pop.Year AS pop_yr
FROM [Census - Population by Sex-Age-Race - 2005 to current] pop
WHERE ([Age Group] == 'Total all ages') AND (pop.Gender LIKE 'Total');

-- Select all values in view
SELECT * 
FROM temp_pop;

-- Create view
CREATE VIEW temp_crime AS
SELECT State AS st, crime.[Violent Crime] AS crime, YEAR AS yr
FROM [UCR - Crime Volume & Rates - 2001 to current] crime 
WHERE crime.YEAR = 2020.0;

-- Select all values in view
SELECT * 
FROM temp_crime;

With our views at our commands we can more easily explore joins 🚀!

Join Cooking Time 🍳

The JOIN keyword combined with ON is used to combine fields from separate tables.

A JOIN clause on its own will result in a cross product, where each row in the first table is paired with each row in the second table. Usually this is not what is desired when combining two tables with data that is related in some way.

For that, we need to tell the computer which columns provide the link between the two tables using the word ON. What we want is to join the data with the same species id.

SELECT *
FROM temp_pop tp
JOIN temp_crime tc
ON tp.st = tc.st;

What did we get? Does it look right?

  • It does “join” things, but incorrectly… Notice repeats and mismatching years.

🔗 Inner Join

  • The JOIN clause without any modifier defaults to INNER JOIN, which means it will include only the rows where there is a match in both tables.
  • Rows without matching values in the other table will not be included in the result set.
  • This is useful when you want to find intersections between two tables… Not here. 😭 But why? One to many matching.

⬅️ Left Join

The LEFT JOIN clause tells the computer that we wish to keep all the records in the first table.

We use it to ensure that all records from the left table are included in the result set, regardless of whether there are matching records in the right table. Also, useful for finding rows in the left table that do not have corresponding rows in the right table.

Left Join in Action 🔗

Use the LEFT keyword in addition to JOIN:

SELECT *
FROM temp_pop tp
LEFT JOIN temp_crime tc
ON tp.st = tc.st;

Or:

SELECT *
FROM temp_crime tc
LEFT JOIN temp_pop tp
ON tp.st = tc.st;

Did it work?

Nop!😡 One to many matching.

Match on Multiple Keys 🔑🔑🔑➕

You can add additional keys to match on using the AND clause:

SELECT *
FROM temp_pop tp
LEFT JOIN temp_crime tc
ON tp.st = tc.st AND tp.pop_yr = tc.yr;

How does that look?

It worked! 💃 (but we got to do some cleaning)

  • Why would we do this?
    • Composite keys: You got a primary key composed of two columns, so you need all parts of the key to ensure accurate matching.
    • Additional filtering criteria: Sometimes, a single key might match multiple rows, and additional keys help filter the results more precisely.

🧹 Clean Up

First, SELECT only the variables you want using the aliases and variable names:

SELECT tc.st, tc.crime, tc.yr, tp.pop_total
FROM temp_crime tc
LEFT JOIN temp_pop tp 
ON tp.st = tc.st AND tp.pop_yr = tc.yr;

Now, remove duplicates using DISTINCT:

SELECT DISTINCT tc.st, tc.crime, tc.yr, tp.pop_total
FROM temp_crime tc
LEFT JOIN temp_pop tp 
ON tp.st = tc.st AND tp.pop_yr = tc.yr;

Voila! 🎉🎉🎉

Compute the Rate 🔢

Don’t forget, we are joining to compute Violent Crime Rate per 100.

Just a little more SQL:

-- Create view
CREATE VIEW temp_values AS
SELECT DISTINCT tc.st, tc.crime, tc.yr, tp.pop_total
FROM temp_crime tc
LEFT JOIN temp_pop tp 
ON tp.st = tc.st AND tp.pop_yr = tc.yr;

-- Compute from view:
SELECT *, (crime / pop_total) * 100000 AS vc_rate
FROM temp_values;

Mini challenge

(Yes, you may Google away)

  • Using a SQL to order your violent crime rate calculations, what state had the highest rate in 2021?

You got time ⏱️, but not too much.

Challenge⚡

Your goal is to create a table containing the total crime and total arrest rates per 100,000 for all ages and genders in 2020. You will once again ignore the UCR state population estimates in favor of the Census estimates for 2020.

To be clear, you will need the following three tables and formulas:

  • UCR - Crime Volume & Rates - 2001 to current
  • UCR - Arrest Volume & Rates - 1999 to current
  • Census - Population by Sex-Age-Race - 2005 to current

\[ \text{Total Crime Rate per 100} = \frac{Number \ of \ Reported \ Violent \ Crimes}{Total \ Population} \times 100,000 \]

\[ \text{Total Arrests Rate per 100} = \frac{Number \ of \ Arrests (All \ Classes)}{Total \ Population} \times 100,000 \]

You got time ⏱️, but not too much. You may work in teams 👥. Good luck🍀!

🙋 My Answer

-- Population view
CREATE VIEW POP AS
SELECT DISTINCT pop.State AS st, pop.Total AS pop_total, pop.Year AS pop_yr
FROM [Census - Population by Sex-Age-Race - 2005 to current] pop
WHERE ([Age Group] == 'Total all ages') AND (pop.Gender LIKE 'Total') AND (Year = 2020.0);

-- Crime view
CREATE VIEW CRIME AS
SELECT DISTINCT State AS st, crime.[Violent Crime] AS crime, YEAR AS yr
FROM [UCR - Crime Volume & Rates - 2001 to current] crime 
WHERE crime.YEAR = 2020.0;

-- Arrests view
CREATE VIEW ARRESTS AS
SELECT DISTINCT arrests.State AS st, arrests.Year AS arr_yr, arrests.[Total all classes] AS arrests
FROM [UCR - Arrest Volume & Rates - 1999 to current] arrests
WHERE (arrests.Year = 2020.0) AND (arrests.[Age group] LIKE 'Total all ages');

-- Join and clean
SELECT tp.st AS state, tp.pop_yr AS year, tp.pop_total AS population, tc.crime, arr.arrests,
    (tc.crime / tp.pop_total) * 100000 AS vc_rate, 
    (arr.arrests / tp.pop_total) * 100000 AS arr_rate
FROM POP tp
LEFT JOIN CRIME tc
ON tp.st = tc.st AND tp.pop_yr = tc.yr
LEFT JOIN ARRESTS arr
ON tp.st = arr.st AND tp.pop_yr = arr.arr_yr;

That is a wrap folks! 🎉

🔑 Key Take Aways

  • Layout your “ingredients” before jumping into a join.
  • Remember that you can use CREATE VIEW and AS to generate custom views of your data prior to joining them.
  • Use a JOIN clause to combine data from two tables, the ON keyword specifies which columns link the tables.
  • Regular JOIN returns only matching rows. Other join clauses provide different behavior, e.g., LEFT JOIN retains all rows of the table on the left side of the clause.

❓Questions?

Resources 🔗

Credits 🙏



This presentation was largely adapted from the wonderful course “Data Management with SQL for Ecologist” by datacarpentry.org and Cathy Tanimura’s “SQL for Data Analysis” book.