Introduction to SQL

Accessing Data with Queries

Chris Callaghan

📖 Rules of engagement

The content in this presentation is not for distribution as content has not been vetted by our benevolent overloads.
The purpose of this document is to guide discussion, elicit feedback, and show you new ideas. The more you lean in, the more you win!
As a follow up on the last point, you can and should ask questions. The content should be a conversation, not a monologue.
You will need to have installed DB Browser for SQLite, link to download here.
Learning is a continuous journey, no one person’s journey is linear. Be kind 🤗. Be patient 🧘. Be inquisitive 🤔.

Objectives 🎯️

Write and build queries.
Document your workflow.
Filter data given criteria.
Sort the results of a query.

First, a bit of a reminder

(And some added concepts)

🧩 The Basics

What is SQL?

“SQL is the language used to communicate with databases…SQL without a database and data in tables is just a text file” (Tanimura, 2021).

“The middle man” (Wright, 2024) 😉

So if SQL is nothing without a database, what is an SQL database?

SQL databases, also known as relational databases, are systems that store collections of tables and organize structured sets of data in a tabular columns-and-rows format, similar to that of a spreadsheet.

Paint me a Picture 🎨

📜 Some Database Vocabulary

SQL is used to access, manipulate, and retrieve data from objects in a database.
Databases can have one or more schemas, which provide the organization and structure and contain other objects.
Within schema, the objects most commonly used in data analysis are tables, views, and functions.
Tables contain fields, which hold the data. Tables may have one or more indexes, which allow for data to be retrieved more effciently.
Views are stored queries that can be referenced easily.
Functions allow commonly used sets of calculations to be referenced easily in queries.

Back to Regular Programming 📺

Start your Engines 🏁

👋 Meet the Data

We will be using the U.S. Census Bureau’s Small Area Income and Poverty Estimates (SAIPE) dataset.

It provides annual estimates of income and poverty statistics for all school districts, counties, and states.

It’s been documented by our RCA in CA 🔥!

Yes, it is only one table… we will start small.

🛑 Load the data before moving forward.🛑

Writing my First Query 🌱

Let’s start by using the saipe table.

If we want to select all the columns in a given table, we use the SELECT and FROM clauses in conjunction with the * wildcard.

SELECT * 
FROM saipe;

Tip

SQL is case insensitive, but for readability and style we capitalize the keywords.
Also, some databases require a semicolon a the end of your SQL statement, so don’t forget it!

If we want a single column, we can type the column name instead of the wildcard.

SELECT state_fips_code 
FROM saipe;

If we want more columns, add more column names after SELECT.

SELECT state_fips_code, county_fips_code, name
FROM saipe;

✍️ Comment, comment, comment

One day, you will get up from your desk and (hopefully) go outside 🌄. When you come back, you may have forgotten your trail of thought…

This is how mistakes are made 😭.

We don’t like mistakes 😠.

Mistake are inevitable, but we can work to minimize them 🤞.

Commenting your code is a valuable practice, whether working alone or as part of a team, because it:

Provides clarity and understanding

Simplifies onboarding and collaboration

Is a best practice and standard in many organizations

Is one way to document your work

Helps you, help you in the future explain your past decisions

How do I Comment? 🤔

It’s super easy! In any SQL script, you can:

Add single comments, which start with --

-- Look ma, no hands!
SELECT state_fips_code, county_fips_code, name
FROM saipe; -- Such cool

Comment out lines of code:

-- SELECT * 
SELECT state_fips_code, county_fips_code, name
FROM saipe;

Add multiline comments using /* and */:

/* Such interesting 
multiline comments */
SELECT state_fips_code, county_fips_code, name
FROM saipe;

Mix and match 🤯!

Now that you know, let’s go back and comment our code.

🤏 Limiting Results

Sometimes you don’t want to see all the results. For example, if you are exploring a database and you just want to see the top n rows to acquaint yourself with a table. This is a great practice when working with large databases.

In such cases, use the LIMIT clause.

SELECT *
FROM saipe
LIMIT 10;

Note

Thanks to a common set of commands, SQL code written for any database will look familiar. However, variations in SQL from database to database exist and are often termed dialects. This means that some commands might be different in Oracle SQL, MSSQL, or other dialects. For example, in MSSQL we’d write the prior command as:

SELECT TOP 10 *
FROM saipe;

Unique Values 🦄

If we want only the unique values so that we can quickly see what states have been sampled we use DISTINCT.

SELECT DISTINCT state_fips_code
FROM saipe;

If we select more than one column, then the distinct pairs of values are returned:

SELECT DISTINCT state_fips_code, county_fips_code
FROM saipe;

➕ Calculated Values

We can also do calculations with the values in a query.

For example, suppose that we want calculate the estimate number of households in poverty without small children (ages 0-4):

SELECT poverty_estimate_all_ages, poverty_estimate_age_0_4, poverty_estimate_all_ages - poverty_estimate_age_0_4
FROM saipe;

When we run the query, the expression poverty_estimate_all_ages - poverty_estimate_age_0_4 is evaluated for each row and appended to that row, in a new column with our resulting values.

Expressions can use any fields, any arithmetic operators (+, -, *, and /) and a variety of built-in functions. For example:

SELECT poverty_percent_all_ages, ROUND(poverty_percent_all_ages, 0)
FROM saipe;

Challenge⚡

Write a query that returns the postal code, county FIPS, name, and difference between estimates for all ages and for people ages 0 to 17.

You got time ⏱️, but not too much.

My answer is:

SELECT postal_code, county_fips_code, name, poverty_estimate_all_ages - poverty_estimate_age_0_17
FROM saipe;

What is yours?

✂️ Filtering

Databases can also filter data – selecting only the data meeting certain criteria.

For example, let’s say we only want data from the state of California.

We can add the WHERE clause to our query:

SELECT *
FROM saipe
WHERE postal_code='CA';

We can also filter numbers. For example, assume that we only want observations with poverty estimates for all ages greater than 1000:

SELECT *
FROM saipe
WHERE poverty_estimate_all_ages >= 10000;

We can use more sophisticated conditions by combining tests with AND and OR.

SELECT *
FROM saipe
-- What does this say?
WHERE (poverty_estimate_all_ages >= 10000) AND (postal_code !='CA');

Challenge⚡

Write a query that returns the state, county FIPS, name, and the estimated poverty poverty for ages. Filter these data to only include observations for California counties (excluding other counties, states, and national observations).

My answer is:

-- Select the required variables
SELECT postal_code, county_fips_code, name, poverty_estimate_all_ages
FROM saipe
-- The state county fips codes are '000', so we combine filters
WHERE (postal_code = 'CA') AND (county_fips_code != '000');

What is yours?

⬆️ Sorting

We can also sort the results of our queries by using ORDER BY.

Go back to basic query and sort by poverty estimates for all ages:

SELECT *
FROM saipe
ORDER BY poverty_estimate_all_ages ASC;

The keyword ASC tells us to order it in ascending order.

We could alternately use DESC to get descending order.

SELECT *
FROM saipe
ORDER BY poverty_estimate_all_ages DESC;

We can also sort on several fields at once.

SELECT *
FROM saipe
ORDER BY name DESC, poverty_estimate_all_ages DESC;

Challenge⚡

Write a query that returns the state, county FIPS, name, the estimated poverty poverty for ages, and the estimated poverty for ages 0-17. Sort with the largest estimates at the top.

You got time ⏱️, but not too much.

My answer is:

-- Select the required variables
SELECT postal_code, county_fips_code, name, poverty_estimate_all_ages, poverty_estimate_age_0_17
FROM saipe
-- Order by descending values
ORDER BY poverty_estimate_all_ages DESC, poverty_estimate_age_0_17 DESC;

What is yours?

🪖 Order of Execution

Another note for ordering. We don’t actually have to display a column to sort by it.

For example, let’s say we want to sort counties in California by their estimate poverty for all ages, but we only want to see the county name.

SELECT name
FROM saipe
WHERE (postal_code = 'CA') AND (county_fips_code != '000')
ORDER BY poverty_estimate_all_ages DESC;

We can do this because sorting occurs earlier in the computational pipeline than field selection. The computer is proceeding as follows:

Filtering rows according to WHERE
Sorting results according to ORDER BY
Displaying requested columns or expressions.

🚨 Clauses are written in a fixed order: SELECT, FROM, WHERE, then ORDER BY 🚨.

That’s a wrap! 🎉

For today…

Key Points 🔑️

It is useful to apply conventions when writing SQL queries to aid readability.
Use logical connectors such as AND or OR to create more complex queries.
Calculations using mathematical symbols can also be performed on SQL queries.
Adding comments in SQL help keep complex queries understandable.

❓Questions?

Resources 🔗

w3school’s SQL tutorial
codecademy’s SQL tutorial
Data Management with SQL for Social Scientists
SQL Basics Cheat Sheet
Google (e.g., “how do i … in SQL with …”)

Credits 🙏

This presentation was largely adapted from the wonderful course “Data Management with SQL for Ecologist” by datacarpentry.org and Cathy Tanimura’s “SQL for Data Analysis” book.