If you’re an SEO, you’re probably running a ton of different audits and analyses.
There are content audits, technical audits, internal link analyses, on-page audits, keyword gap analyses, etc.
To build these audits and analyses, you have to use various SEO tools. At my agency, we’re using ~10 right now, and I bet that’s on the lower end of most.
Each tool has a specific use case (which is why you need so many).
You use Ahrefs or SEMRush for keyword and backlink data, Screaming Frog and Sitebulb for crawl data, Google Analytics for session and UX metrics, Google Search Console for query and click data, etc.
Out of the box, these types of tools are helpful, but the insights you can derive from them are limited.
Because these tools are built for large audiences with broad use cases, it’s simply not feasible for a company like SEMRush or Ahrefs to build every little analysis and reporting view into their tool.
The real value of these tools is their data. Luckily, most of these tool providers allow you to export their raw data in one way or another.
Once you have these raw datasets, if you know how to combine them, filter them down, aggregate them, and then create custom logic, the depth of analysis you can do is miles more advanced than if you’re using each tool’s UI.
That’s where SQL (Structured Query Language) comes in. Now before you get scared off by the thought of learning a programming language (if I can even call it that), let me ease that fear.
If you feel confident you can learn Excel, I’m confident you can learn SQL.
I’ve largely automated ~80% of the deliverables my agency creates writing SQL in BigQuery. It has saved me countless hours of work and has been a major contributing factor in growing my agency to over 6-figures while working full-time.
In this post, I’m going to show you why knowing basic SQL is such a powerful skill for an SEO to have and how to start learning it.
One of the issues with learning technical skills is that they are often WAY overcomplicated by the teacher. I want to avoid that and make this as simple as possible for you.
Before we talk about SQL, let’s define a database.
A database is simply a place where structured information (i.e. data) is stored.
Picture a folder on your computer with two Excel files in it. The folder is the database, and each Excel file is what is called a table.
You can also think of an Excel file as a database. Each sheet, in this case, would be a table.
SQL is an intuitive “programming” language that allows you to access and manipulate one or more databases and the data the lives within tables in those databases.
It’s very similar to using an Excel formula that references different cells in one or more sheets. It’s just much more dynamic and powerful.
As an SEO, SQL can help you automate manual tasks, speed up processes and workflows, and ultimately get more out of the data you are already using.
The website quality audit is a tool developed by The Blueprint Training that is essentially a content audit, technical audit, and overall website quality report all in one. It’s built by taking six different datasets, manipulating the data, and then joining it together (we’ll talk about “joins” later) on URL.
CIFL has an excellent service where you can get a single website quality audit report built for just $30. This is a fantastic option for many agencies.
For me, however, I wanted the ability to customize the report and build it into various workflows and processes within my agency.
When this tool was released, because I knew databases and SQL, I was able to quickly see how to rebuild the tool in BigQuery. I was able to tweak it to fit my needs.
For example, instead of using the SEMRush keyword dataset, I wanted to build it using the Ahrefs Keyword and GSC query dataset. I also wanted to automate a portion of the manual steps required to categorize pages and URL actions.
There were some additional columns I wanted to add to the tool like the date a blog post was published.
I also wanted to build in keyword research for each URL into the website quality audit and then add rank tracking data to make sure the right URLs were ranking.
To give you an idea of how fast you can build cool tools like the website quality audit once you know SQL, this took me around ~5 hours to put together. Now it’s a process I have someone on my team run that takes less than 30 minutes.
There are always going to be manual tasks in SEO (at least in the foreseeable future).
There isn’t a way to avoid that, at least for now. However, you can largely automate the majority of analysis and reporting you’re doing as an SEO using a little bit of SQL.
In addition to the website quality audit, we’ve built tools in BigQuery for the following:
SQL is more about learning how to think about data and databases. Once you have a solid understanding of how they work, it’ll be hard to stop the ideas from coming to you!
SQL allows you to stitch together data and build handy reports and tools. But it doesn’t just stop there.
You can also use SQL to do analysis and build various automations. Once I finish building a website quality audit, I push the final dataset into a new table in BigQuery that I can run analysis and automations on.
Below I show specific examples of SQL for automation and analysis. The goal is not to have you understand the actual SQL, but to see how powerful a short piece of code can be.
At the end of this article, I dumb down the basics of SQL and then provide you with some learning resources.
As part of the website quality audit, you have to manually tag pages by page category. For example, is the URL a blog post, landing page, blog category, etc.
You can easily automate much of this process using simple IF THEN login.
In SQL, IF statements are called CASE statements.
Here’s an example of a CASE statement that automates the categorization of blog posts, blog category pages, and blog tag pages.
CASE WHEN URL LIKE ‘%/blog/%’ Then ‘Blog Post’ WHEN URL LIKE ‘%/category/%’ Then ‘Blog Category’ WHEN URL LIKE ‘%/tag/%’ Then ‘Blog Tag’ ELSE ‘’ END
You can customize this same logic to match the URL structure of the site your working on and automate a vast portion of this process.
Another part of the website quality audit that’s manual is the URL action portion, which is essentially a content audit.
You determine what to do with a URL based on various data points. Instead of manually looking at every URL, let’s create some rules to automate this. We’ll develop rules within a CASE statement similar to the section above.
If a URL has significant traffic going to it, we’ll want to tag the URL with “Update On-Page” so that we can look at the page in more detail, make on-page changes, and increase traffic. Here’s how we’ll do that:
CASE WHEN Sessions > 100 THEN ‘Update On-Page’ ELSE ‘’ END
Now let’s add logic to the CASE statement that noindexes blog category and blog tag pages that aren’t already noindexed.
CASE WHEN Sessions > 100 THEN ‘Update On-Page’ WHEN PageCategory IN(‘Blog Category’,’Blog Tag’) AND Indexability <> ‘Nonindexable’ Then ‘NoIndex’ ELSE ‘’ END
Now let’s add logic that 301 redirects thin content pages, that are blog posts, and that have links:
CASE WHEN Sessions > 100 THEN ‘Update On Page’ WHEN PageCategory IN(‘Blog Category’,’Blog Tag’) AND Indexability <> ‘Nonindexable’ Then ‘NoIndex’ WHEN PageCategory = ‘Blog Post’ AND Backlinks > 0 AND WordCount < 400 THEN ‘301 Redirect’ ELSE ‘’ END
You can write this custom logic for each site you’re working on depending on how things are set up.
Duplicate pages can hurt your rankings. We can write a simple query that checks for duplicate pages.
Select PageTitle ,COUNT(PageTitle) as TotalPageTitles From WebsiteQualityAuditFinalDataset Where StatusCode = 200 And Indexability = ‘Indexable’ Group By PageTitle Order By 2 Desc
What this query is doing is taking the final website quality audit data set and counting the number of page titles that are the same where the status code is 200, and the pages are indexable.
It’s then ordering the TotalPageTitles column in descending order, so the pages with the most duplicated pages are shown at the top.
Buckets are simple groupings. For example, let’s say I wanted to group Domain Rating into groups of 10. I would create the following buckets:
These are useful when you are building reporting from your data.
For example, I wrote a post that shows how to automate a competitor backlink analysis using a Google Sheets template we built. One of the views I built into the report shows link counts by DR bucket and anchor text type:
Here’s how you would create buckets using SQL:
CASE WHEN DR BETWEEN 0 and 10 Then ‘0 to 10’ WHEN DR BETWEEN 11 and 20 THEN ‘11 to 20’ WHEN DR BETWEEN 21 and 30 THEN ‘21 to 30’ WHEN DR BETWEEN 31 and 40 THEN ‘31 to 40’ WHEN DR BETWEEN 41 and 50 THEN ‘41 to 50’ WHEN DR BETWEEN 51 and 60 THEN ‘51 to 60’ WHEN DR BETWEEN 61 and 70 THEN ‘61 to 70’ WHEN DR BETWEEN 71 and 80 THEN ‘71 to 80’ WHEN DR BETWEEN 81 and 90 THEN ‘81 to 90’ WHEN DR BETWEEN 91 and 1000 THEN ‘91 to 100’ ELSE NULL END
You can read the above snippet of code as, IF the DR is between 0 and 10, THEN create the bucket ‘0 to 10’, etc.
When performing analysis on SEO datasets (especially as it relates to link building), you often want to take ugly URLs and parse out the root domain or TLD. You can use a complex piece of REGEX, or you can use the following functions:
For example, here’s my URL: https://codingisforlosers.com/recipes/
If I write NET.REG_DOMAIN(”https://codingisforlosers.com/recipes/”), the output would be the following:
If I write NET.PUBLIC_SUFFIX(https://codingisforlosers.com/recipes/), the output would be the following
Simple but extremely useful.
As an SEO, you often want to tag what part of the marketing funnel a keyword is.
The logic that goes into this is simple.
First, import a table into BigQuery that looks like this:
Run this code:
WITH Keywords AS ( SELECT DISTINCT Keyword FROM `ciflexample.CIFL.tblFunnel` ), Funnel AS ( SELECT DISTINCT TOFU, MOFU, BOFU FROM `ciflexample.CIFL.tblFunnel` WHERE TOFU IS NOT NULL ), Classify AS ( SELECT DISTINCT Keyword, CASE WHEN REGEXP_CONTAINS(Keyword,TOFU) = TRUE THEN 1 ELSE 0 END AS IsTOFU, CASE WHEN REGEXP_CONTAINS(Keyword,MOFU) = TRUE THEN 1 ELSE 0 END AS IsMOFU, CASE WHEN REGEXP_CONTAINS(Keyword,BOFU) = TRUE THEN 1 ELSE 0 END AS IsBOFU FROM Keywords k CROSS JOIN Funnel f ), Aggregate AS ( SELECT DISTINCT Keyword, SUM(IsTofu) AS IsTofu, SUM(IsMofu) AS IsMofu, SUM(IsBofu) AS IsBofu FROM Classify GROUP BY Keyword ) Select *, CASE WHEN IsTofu + IsMofu + IsBofu > 1 THEN "DoubleCheck" When IsTOFU = 1 and IsMofu = 0 and IsBofu = 0 THEN 'TOFU' When IsTOFU = 0 and IsMofu = 1 and IsBofu = 0 THEN 'MOFU' When IsTOFU = 0 and IsMofu = 0 and IsBofu = 1 THEN 'BOFU' ELSE NULL END AS Funnel FROM Aggregate
All this is doing is saying, if the keyword contains any of the words in the TOFU, MOFU, OR BOFU column, classify it as TOFU, MOFU, or BOFU.
Here’s the output:
For keywords that flag as more than one part of the funnel, I create logic to flag the keyword a “DoubleCheck.” All it does is add the IsTofu, IsMofu, and IsBofu column together and then will flag as DoubleCheck if the sum is greater than one.
If you look at the “how to buy this” keyword, it gets flagged as “double check” because “how” is a TOFU keyword and “buy” is a BOFU keyword. I’d want someone to manual look at this one to determine its position in the funnel.
I’m not going to go too deep into the basics of SQL because David already has a post you can check out here.
However, I do want to try to show you how simple a lot of the foundational SQL concepts are.
SQL is a very intuitive language, meaning it reads like the English language. When I’m explaining it, I’m going to read the queries like they were English.
BigQuery is my go-to platform for writing SQL. It’s free unless you’re doing some serious data wrangling.
Once you’ve set it up, you can hook Google Sheets up directly to it.
Here’s a Screaming Frog crawl I ran on CIFL. I kept the following six columns:
Here’s how you can connect Google Sheets and BigQuery.
Once you have your Google Sheet connected, you can write the following foundational SQL statement to view your data.
SELECT * FROM `ciflexample.CIFL.tblsfdata`
The snippet above simply means “select (i.e. SELECT) all (i.e.) of the columns from (i.e. FROM) the Google Sheet we just connected that I named ciflexample.cifl.tblsfdata.
If we click RUN, the output will be the same data as we have in our Google Sheet.
Now, let’s say we want to audit CIFL’s 301 redirects. We’ll use the WHERE clause to create logic that says to only show us URLs with a status code equal to 301. Here’s what that look looks like:
Now let’s say we want to see how many of each status code, CIFL has. To do that, we’ll need to count each URL by status code. This is called “aggregating” your data. If you were to do a SUM, AVERAGE, etc. those would be aggregations as well.
In the query above, I’m counting Address (i.e. URL) and then grouping the count by (i.e. GROUP BY) status code. This shows me how many URLs fall into each status code category.
One of the last foundational concepts in SQL is joins. Let’s say we have our Screaming Frog data set, and we want to see how many unique referring domains each URL has.
Here’s a Google Sheet that contains the Link URL for each unique referring domains I imported from Ahrefs:
We’re going to use a couple of the concepts above to get the total link count by URL:
Now, all we need to do is get the backlink counts appended to the Screaming Frog dataset. It would look like this:
We do this through what’s called a “join” in SQL. Notice how our Screaming Frog dataset and our backlink dataset both have the URL column in common. A” join” will match each dataset on whatever column or columns you specify. Here’s the join:
In plain English, we are joining the AggregateLinkData (i.e. the backlink dataset) table to the SF (i.e. the Screaming Frog dataset) table on (i.e. on) the URL column.
Here’s the final set of code to do this:
WITH sf AS ( SELECT *, REPLACE(REPLACE(REPLACE(REPLACE(Address,'https://www',''),'https://',''),'http://www.',''),'http://','') AS URL FROM `ciflexample.CIFL.tblsfdata` sf WHERE StatusCode = '200'), LinkData AS ( SELECT REPLACE(REPLACE(REPLACE(REPLACE(LinkURL,'https://www',''),'https://',''),'http://www.',''),'http://','') AS URL FROM `ciflexample.CIFL.tblbacklinkdata` ), AggregateLinkData AS ( SELECT URL, COUNT(URL) AS LinkCount FROM LinkData GROUP BY URL ) SELECT s.*, ld.LinkCount FROM SF s LEFT JOIN AggregateLinkData ld ON s.URL = ld.URL ORDER BY ld.LinkCount DESC
I hope this article showed you that learning SQL is a very attainable skill. It can make you a more efficient and data-driven SEO.
To learn SQL, you have to have projects to apply it to. Don’t be the person who’s taken 100 hours of online programming classes, but still can’t build anything on their own. The best advice I can give you to learn SQL is:
If you follow the advice given above, you’ll learn SQL in no time AND build a lot of cool stuff in the process.
With SQL, you can build your own SEO tools, process automations, and workflows. You won’t have to rely on third-party tools anymore to provide direction. You can use their datasets to derive the insights you need to make decisions.
Daniel Cuttridge makes an excellent point in an article he wrote about his methodology for on-page optimization.
In it, he’s talking about some of the fundamental problems with SEO tools and says, “They contain many useless recommendations that are built in to compete with their competitors in an inflating game of ‘total items checked’.”
This is a bold, but valid statement.
Instead of confusing yourself trying to learn all the nuances of these tools (only to get useless or misleading information), export the raw data, throw it into a database like BigQuery, and then write some SQL to answer specific questions you have.