Start with 7 free days of training.

Gain instant access to our entire IT training library, free for your first week.
Train anytime on your desktop, tablet, or mobile devices.

The exam associated with this course was retired December 31, 2016. However, this course still retains value as a training resource....
The exam associated with this course was retired December 31, 2016. However, this course still retains value as a training resource.

Let Google know what you think of our training course with this survey.

This Google BigQuery course with Garth Schulte covers the exam objectives for the Google Qualified BigQuery Developer certification, and gets you up to speed on the Google Cloud Platform's blazing fast Big Data analytics solution. CBT Nuggets is a Google Cloud Platform - Training Partner.

Recommended skills:
  • Fundamental SQL skills
  • Fundamental Java and/or Python programming skills

Recommended equipment:
  • A machine running Windows, OSX, or Linux

Related certifications:
  • Google Qualified BigQuery Developer

Related job functions:
  • Big Data Analytics
  • SQL Developer
  • DevOps
  • Cloud computing
  • Data wizardry

Big Data is a hot topic. But getting answers from a Big Data pipeline has always been difficult and expensive, as it required skilled engineers familiar with complex Big Data ecosystems such as Hadoop. BigQuery changes all of that! Get on board with the hottest technology in Big Data analytics and let Google help support your Big Data pipeline with BigQuery, a Google Cloud Platform product that makes your Big Data easily accessible so anyone can get answers quickly and easily. Google Qualified Developer is a new breed of developer-centric certifications for Google Cloud Platform products.

This course is for Google BigQuery and designed from Google's internal instructor-led course to prepare you for the exam. It's ideal for beginner or advanced IT professionals looking to add cloud-based Big Data analytical skills and Google certified credentials to their resume.
 show less
1. Introduction to Google BigQuery (22 min)
2. BigQuery Basics (22 min)
3. BigQuery Basics Demo (30 min)
4. Importing and Exporting Data (19 min)
5. Importing and Exporting Data Demo (25 min)
6. Querying Data Basics (24 min)
7. Querying Data Basics Demo (29 min)
8. Querying Data Advanced (27 min)
9. Querying Data Advanced Demo (22 min)
10. Managing Data (19 min)
11. Managing Data Demo (18 min)
12. BigQuery Internals (9 min)
13. Programmatic BigQuery (14 min)
14. Programmatic BigQuery Demo (Python) (11 min)
15. Programmatic BigQuery Demo (Java) (16 min)
16. BigQuery Integration and Visualization Tools (6 min)
17. Google Cloud Platform Qualified Developer (9 min)

Introduction to Google BigQuery


Introduction to Google BigQuery. Hey, everyone. Welcome to BigQuery. In this course, we're going to learn all about the future of big data analytics. BigQuery is an impressive technology, and I tell people that it's way ahead of its time. Because big data is still in its infancy, and to get a big data infrastructure set up takes a lot of work and a lot of technologies.


So we're going to learn all about what BigQuery is, where it fits inside of your big data pipeline, where it came from, and why it's so awesome. So in this introductory Nugget, we're going to start here with the course introduction just to get you up to speed on what to expect out of the course-- what our objectives are, who this course is for-- and we'll talk about the Google-qualified BigQuery Developer certification.


Google has put out certifications for each product inside of the Google Cloud platform. So there's one for App Engine, there's one for Compute Engine, one for Cloud Storage, one for BigQuery here, and one for Cloud SQL. So we'll talk a little bit about our objective.


From there, we'll get into the good stuff. And we'll start at the beginning. And I mean the very beginning. Back in 1998, Google was just a machine sitting in a lab at Stanford. So they were really the first ones with the big data problem. In order for web search to be successful, they had to index the entire Internet.


That's a big data problem. And back then, the only way to scale that anybody knew of was to just buy these huge, expensive servers. Well, Google went about it to save money. And rather than buy one or two big, expensive servers, they said hey, why don't we sink all that money into hundreds or thousands of cheap commodity servers and then write our own custom distributed software to treat those machines as a single unit?


And these underlying technologies are impressive engineering feats. And the world knows it. Because when Google releases a white paper, industries spawn up. Other big IT companies, like the Facebooks and Yahoos and eBays and Amazons in the world, follow suit and build their own implementations off of Google's white paper.


So we're going to take a really big look there at Google's big data stack, from the early days to the current days. From there, we'll get into BigQuery itself. We'll define what it is, why and when we should use it, and we'll look at a very simple end-to-end workflow.


And we'll finish this Nugget with a look at a BigQuery query demo just to show you how easy it is to work with BigQuery. We'll log in, we'll write a query, and we'll see how fast we can get those results on a large amount of data. All right. Let's get started and talk about what this course is going to consist of.


So once again, welcome to Google's BigQuery. We're going to have a lot of fun in this course. We're going to learn a lot. I'll keep the energy up, so if you're feeling kind of down right now, stand up. Do some jumping jacks. Get excited. Because this technology is awesome.


Here's how the format of this course is going to work. Generally, we're going to have a theory Nugget and follow it up with a live demonstration Nugget to show how that theory works in practice. There are a few exceptions to that rule. The first one here, our introduction, we're not going to have a demo associated with it.


Our last one, when we go over the Google-qualified BigQuery Developer certification, we won't have a demo associated with it. BigQuery internals, where we get into the inner workings of Dremel and BigQuery and multi-level execution trees to see exactly how BigQuery gets its answer so fast.


We won't have a demo associated with it. And then this one won't as well-- our BigQuery integration and visualization, where we go over third-party ETL tools that can talk directly to BigQuery along with some third-party visualization tools that you can use to create nice digital dashboards from your BigQuery data.


The rest of these are going to have at least two Nuggets associated with them. So all of these will have a theory and follow it up with a demo. Except programmatic BigQuery. We're going to have three Nuggets associated with that. One for theory, one for how to work with the BigQuery RESTful API and BigQuery client libraries in Java, and one with how to do the same thing in Python.


So all in all, 17 Nuggets that are going to span the core concepts and then some when it comes to BigQuery. Who is this for? Really anybody, because that's how BigQuery rolls. It's really easy to get into. If you have basic SQL knowledge, you're ready to query big data.


Pretty cool. So really, it's intended for big data professionals who are excited and interested to learn about the future here of cloud business intelligence and big data analytics. And finally, our objective here for this course is to prepare you to become a Google-qualified BigQuery developer.


So my responsibility is to get you ready to take that test and become certified. More on that when we get into our final Nugget here, when we talk about the certification itself. So let's start at the beginning. Google and big data. Synonymous with one another.


Because Google is really the reason that big data went mainstream and technologies like Hadoop and the entire ecosystem around it came to be. So, again, Google had this problem first. The big data problem first. Because back in the late '90s, when web search started blowing up, they said you know what, we need to index the entire Internet.


How do we do that? That's an incredibly large volume of data. And so they looked at their options, and they said here's how everybody else does it. They scale up. But you know what, scaling up is expensive. We have to buy a $1 million server and then a few years later upgrade to the $3 or $4 or $5 million server?


That's just not economical. On top of that, scaling up is error-prone. We have single points of failure. And how do we get away from that? We buy more big machines-- which leads back to the first problem. Now it's even more expensive than it was. Not to mention scaling up has limitations.


There's only so many CPUs and so much memory you can stuff into a machine. Plus they would now be reliant on other hardware and software vendors. So what it boiled down to here is the traditional methods of scaling up just didn't cut it. Google wanted to not only save money, but they wanted to be self-reliant.


And they wanted an infrastructure where failure was not only OK, but it was the norm. If you expect failure and you design around failure, then it's all good when failure happens, right? So that's why they went with scaling out rather than scaling up.


By buying a fleet of commodity machines and then writing distributed software to treat those machines as a single unit where, again, failure is the norm. Because as long as our software can handle it and we can replicate our data across the cluster multiple times, then if something happens, eh, no big deal.


Our data is safe. And if the time comes where we get bigger and bigger and we need to scale out even more, no problem-- we just add more servers into the cluster. So that's the hardware side of things. Lots of little servers. Scaling up. The software side of things is where everything starts to get really impressive and why Google is known as the masters and the pioneers of distributed architectures.


So it all started in the early 2000s in what's known as Google's big data stack, phase 1, or version 1. And it consists of three core technologies. The first one was the Google File System. Obviously we've got this army of machines. We're going to need a file system that spans all of these machines.


And they're just wasn't anything out there in the market. So they built their own. And the core concept behind GFS is failure. Disks fail all the time. So if we have a file system that can handle failure by actively monitoring our cluster and ensuring that the data is always replicated multiple times across our cluster, then we're in good shape.


So that's GFS-- a distributed cluster-based file system. Next up came MapReduce, which is a programming framework for processing all the data spread across the cluster of machines in parallel. So if you're Google and you need to store the entire Internet across the GFS cluster and you've got crawlers out there every day storing that, how are you going to extract and aggregate the relevant information out there to create your search indices from?


So MapReduce was born. And Google engineers would write MapReduce jobs that ran every night to recompute their search indices. The last piece of Google's big data one 1.0 stack is BigTable. BigTable sits on top of GFS and brings a little structure to that unstructured data sitting within.


Think of BigTable as this big, distributed hashmap-- a key value store, in other words-- spread across many machines. So when you put something in BigTable, you give it a key, you store the value, and then you can only look up that value by key. Now, again, imagine we have the Internet here, stored in this unstructured mess in GFS.


MapReduce transforms that data, makes sense out of it, and then stores it in BigTable by key, the key being the URL, the value being all that relevant information that MapReduce transformed. Now, all of a sudden, finding data is super-easy, not to mention extremely efficient.


So that's BigTable-- structured storage that can scale to petabytes across thousands of machines. And that sits underneath a good majority of Google products out there today. So at the end of the big data stack is really where the big data revolution as we know it began.


Because in late 2003, Google released the white paper to GFS. Not long after that, in 2004, they released the white paper to MapReduce. And about 2006, they released the white paper to BigTable. Well, everybody knows what happens after they release the white paper to MapReduce, right?


A gentleman by the name of Doug Cutting, along with the open-source community, built everyone's favorite elephant, Hadoop. So GFS is where HDFS came from-- the Hadoop file system. Hadoop has its own version of MapReduce, which is based on Google's MapReduce, and then BigTable is really known as the grandfather of NoSQL, because that's really where CouchDB came from, MongoDB, Cassandra-- all these other NoSQL databases-- and also where HBase, which is a Hadoop subproject, came from.


Not to mention all these other big IT companies out there that took notice, followed suit, and built their own internal versions. So while that big data revolution is happening, Google-- as it always seems, they are ahead of the curve with everything-- went hard at work on version 2.0 of their big data stack.


They already knew the challenges that these three technologies posed-- for instance, GFS. While it's great for durability-- your data is always going to be safe, because it's replicated everywhere-- it suffers from reduced availability. And that's because it had a metadata server that was a single point of failure.


And if you're familiar with Hadoop, it's very much like the NameNode and secondary NameNode. If those go down, you're totally hosed. Your data is unavailable. So that's the problem with GFS-- single point of failure. MapReduce? Oh, boy, MapReduce. MapReduce, number one, is difficult.


You need to be a pretty seasoned programmer just to write even basic MapReduce jobs. But complex MapReduce jobs? Where there's multiple map and reduce phases coordinating state and doing joins? It's a nightmare. So MapReduce is difficult. And it's slow.


It's built and meant for batch processing, not real-time or near-real-time processing or analytics. BigTable, its problems come up in multi-data-center environments. Because replication between data centers is eventually consistent. And eventual consistency is another nightmare for programmers to deal with.


DNS is a great example of eventual consistency, because it's a globally distributed database. And if we were to make a change to a DNS record, it would take a while for it to propagate across the Internet. So needless to say, it's difficult to build applications around eventual consistency, and that was one of the big challenges with BigTable.


So Google refined these ideas and built new technologies on top of them to solve many of those challenges. And you can see that here with Megastore. Megastore is built on top of BigTable, and it brings strong consistency at the data center level, where BigTable left off with eventual consistency.


And it does so using a distributed commit log all agreed upon using a consensus algorithm known as Paxos. Strong consistency, by the way, just means that you can read your writes. So if you are editing data inside your web application and hit the Save button, then you expect the next time you read that and anybody else that reads that record are going to see the same results.


So that's Megastore-- strong consistency on top of BigTable. Just simply means if data goes through Megastore to get into BigTable, that data needs to be hardened-- committed across those BigTable tablet servers whether they're in one data center or multiple data centers.


And I've got another fun story. In fact, everything in the Google Cloud platform is really just a friendly interface for the public to use on top of all these technologies. But this is really evident when you look at the cloud data store in relation to BigTable and Megastore.


If you're not familiar with the cloud data store, it's really known as App Engine's NoSQL database, because it's really easy to work with. But it's also its own product that you can hit externally. Well, essentially, the cloud data store is built on top of Megastore and BigTable.


So when you're designing your data models and your data access layer in, say, App Engine, you get to choose between eventual consistency and strong consistency. Which is really cool that you have the option. Because, sure, there's applications or parts of an application out there that can tolerate data latency and can benefit from eventual consistency.


Cool stuff. Now, let's keep moving on here. We're going to stay underneath. And the reason I'm staying underneath here is because these all happen to deal with the same thing, really-- distributed consistency. Distributed consistency is obviously a very difficult thing, and Google has sunk a lot of resources in over the years to solve the challenges.


And Spanner is another really cool technology. In fact, it's known as the next iteration of Megastore, because it does everything that Megastore does, only on a much larger scale. So it's just another distributed database, but it's a planetwide-scale distributed database.


So if you're designing a planetwide distributed database across your network, like Google, that spans the globe, what's your biggest challenge going to be if you still want to support global transactions and strong consistency at the planet level? Time, right?


Time's going to be the big factor. So what Google did to solve that through Spanner is install GPS in atomic clocks in every data center. That is pretty hardcore. And that's how they solved the global time ordering problem, and that's also, now, how you can see why they called it Spanner.


Next up we have Colossus. Not much is known about Colossus from an architectural standpoint, because Google has not yet made the research papers public. But what we do know is it is GFS version 2. So they took everything they learned over the years with GFS and improved upon and solved any limitations and just made a bigger, better version.


And it's all but replaced GFS at Google. We're going to go back in time a bit here, but we saved BigQuery supertech for last. Dremel. Dremel came about because Google engineers had a hard time getting quick answers using MapReduce. Again, MapReduce-- batch processing, it takes a while to run jobs to get answers.


So Google said you know, we need a way that we can get answers quick, we can do ad hoc analysis of our data, and we need a friendly interface, a SQL-like interface so anybody internally can write queries to get answers out of our data. So that's what they did.


They built a distributed SQL engine that could pull data from virtually any storage technology at Google, which made it extremely easy for anybody to write queries against anything. And, by the way, Apache Drill is the open-source implementation here of Google's Dremel.


So there's a history of big date at Google. If there's one thing to note here, it's that it really is a tale of two stacks. And the first stack was all around batch processing. The second stack is all about real-time processing. And that's a great segue into BigQuery.


So what is BigQuery? It is the public implementation of Dremel. It is a fully managed data analysis service for our big data. You can almost go as far as to say it's big data as a service or analysis as a service. Because that's essentially what it is.


It's reliable. Secure, scalable, friendly, and extremely fast. And it does so by using a multi-execution tree to dispatch queries and aggregate results across thousands of machines-- something that we will dig into when we get into the internals. Why use BigQuery?


Well, because it fills a necessary gap in the big data ecosystem, and that is ad hoc. Real-time analysis of our big data. There's just no good way to do it. Sure, in the Hadoop ecosystem you have Pig and you have Hive. Those were meant for friendly interfaces on top of MapReduce, but they're not real-time analytics.


They're still spinning up MapReduce jobs, and it's still batch processing underneath the hood. And there are other techs out there, such as Cloudera's Impala. There's, as I mentioned, Apache Drill. Storm and Spark also aim to get close-to-real-time analysis.


But nothing does it quite like BigQuery. Because with those other techs, you're still going to need skilled people to get at the data. BigQuery totally redefines that, because anybody that's familiar with SQL can easily access the data stored inside of BigQuery.


And it's really easy to work with from all angles, not just the query perspective, but from importing and exporting, from shaping your data inside of BigQuery, to programmatically pulling it out via the API. As you'll see throughout this course, all of it is pretty straightforward to work with.


And one more time, I'm just going to talk a little bit about BigQuery and MapReduce, because people that get into BigQuery the first time that are familiar with Hadoop, they're like ah, cool-- a replacement for MapReduce! And that's not at all true. This is not an OR proposition, but it's an AND proposition.


BigQuery and MapReduce should be in the same sentence, because they're in the same pipeline. It's just that MapReduce is in the beginning of the pipeline for taking all the raw data and aggregating and transforming it into something that BigQuery can understand.


And then BigQuery's near the end of the pipeline, where you ingest it into BigQuery and then take it out of BigQuery and put it inside of a digital dashboard or a visualization or a web app. So let's look at really simple end-to-end workflow. The first step is always going to be ingestion.


And that is getting your data wherever it lives up into Google's cloud. It could live in a relational database. Could be sitting in log files on your on-premise system. Could be sitting in another cloud in, say, Amazon's cloud or whatnot. So you take your raw data and you get it onto Google's network.


And you're going to hear the cloud storage product quite a bit. Because think of it as the Google Cloud platform's file system. Because that's essentially what it is. It's going to be used as a staging area for a lot of things in the Google Cloud platform, but especially when it comes to BigQuery, because BigQuery can easily import and export from Cloud Storage.


But once we get it into Cloud Storage in a raw format, now we could optionally spin up a Compute Engine instance or a network of Compute Engine instances containing Hadoop and MapReduce. We could transform that data into something that BigQuery understands, which is CSV or JSON-- more on that later-- and then we will put it back to cloud storage.


Once it's in cloud storage in a format that BigQuery understands, we can pull it into BigQuery. Once that data's up in BigQuery, we can do anything with it. We could interactively query using the nice web UI that's in the cloud developer's console. We could pull this data out from an App Engine app using the API.


Or we could use third-party tools to create some nice digital dashboards and visualizations. Let's move on and finish up with a quick demo. I'm going to show you how easy it is to work with BigQuery. So we fired up a browser and went to cloud.google.com.


I'm already logged in with my Google account, so we'll just go right to my console. This'll take us into the Google developer's console, where it'll have all your projects listed. So we've got a project here for our App Engine course and now one for our BigQuery course, so we're going to click on GBQ Nuggets here.


That'll take us right into our project and show us our dashboard here for working with all of the products and services here in the Google Cloud platform. We're interested in BigQuery here, so let's expand big data, head right into BigQuery. That's going to pop open a new tab here and take us right into BigQuery management console.


From here we can hit Compose Query. You'll get a little help box that pops up here if you want to look at some sample queries or look at the query reference that'll take you right to the docks. But we can just start ripping off queries here. If we had our own data sets and tables in here, we could start hitting them right away.


Thankfully, Google supplies us with some nice public data samples here. So, for instance, you have gsod, some weather data here. If you click on Details, you can see how big it is-- 16 gigabytes, 114 million rows. Some Shakespeare data to work with. Some natality data to work with here-- 21.9 gigs.


And the Wikipedia one is certainly the biggest here-- 35.7 gigs, 313 million rows. So we've got plenty of sample data to work with here to test out BigQuery. So from here, if we wanted to write a query, we could just start typing in SQL. Or if you're not a fan of typing, you could just start clicking.


Watch this. Query table. That will give us to the structure, the skeleton, of our SQL. Now we can go over to schema and just start clicking on our columns. Then we can format it, or aggregates, or whatever we want on our own, and boom-- we're done. We can execute the query.


You can also open up the validator here, which'll show you how much data your query is going to run over. Now, I've got a fun little query on the clipboard. There we go. Now, let's give ourselves a little more screen real estate to work with. And you'll see right off the bat we're going to hit 51 gigabytes of data.


So a good amount of data. You can we have an outer select that's just pulling title views. And then we have an inner select that's doing all the work here. This inner select is grabbing title. It's aggregating the views here as the views from BigQuery samples.


Notice this. This project isn't in the public data. Well, this is something that's really cool about BigQuery is that it is a single namespace. So that means you can share your projects with anybody in the world. They can share their projects with you.


You can collaborate with other partners and vendors very, very easily. All right. And if you do want to get this project in your view, very easy to do. Just grab the project name here. Let's just copy that. Drop down the management next to your project, hit Switch to Project, Display Project, paste the name of it in there, hit OK, and now we're going to get all the BigQuery samples that are in the documentation in our view as well.


And by the way, the table that we're going to hit here is the Wikimedia page views. If you want a serious data set, this Wikipedia page views-- watch this. If we go to the details, whoops. I got to actually hit a table here. Let's just grab one. How about 2010-- 200910.


Hit the details here. OK, 433 gigs. That's not too bad. Let's go down here to 201001 and look at the details. Here you go. 2.07 terabytes. That's 25 billion rows. Just be careful if you're going to run queries against this kind of data, because you'll run up quite the bill if you're going to process that amount of data.


But let's go back to our query here. So this query is going to run the Wikimedia, which isn't quite as big. And we're going to hit 51 gigabytes of rows. But the really interesting thing here is look at this-- we're using regular expressions. Regular expressions are built into the language, which is extremely powerful.


And you would think that it would crush performance, but not at all. And this regular expression here, what we're doing is saying give us any page title that starts with a G, ends in an E, and has two rows in between. So that's it. This should give us a title and the number of views, ordered descending, and we're limiting it to 100 here of those most popular page views in Wikimedia he for 01 of 2010.


So let's see how fast BigQuery can handle 51 gigabytes of data. Let's hit run query, and this should take under-- wow, two seconds. Well, you know what? I had it cached because I did it before. Let me uncheck Use Cache Results, to get be fair here, and start from scratch.


So this is really going to run. Dremel is hitting Google servers right now and pulling that data out. And this should take-- there you go. Six seconds for 50 gigs of data. How impressive is that? But if we look down at the bottom here, we have the query results.


Five per page, or you can tap through your page up to 100, since that's what we limited the query to. But check it out-- Google's obviously going to be the first one. 808,000 page views for January 2010 on Wikimedia. Google Chrome. Hey, the growth hormone.


Well, it fit our regular expression, right? Gastroesophageal reflux disease-- it's the best I can do with that one. And Google Wave. So there's your quick demo on how awesome and easy it is to work with BigQuery. Much more to come. In this CBT Nugget, we took an introduction to Google BigQuery.


We started off with a course introduction to get you familiar with what to expect from this course contentwise and also state our objective, which is the Google-qualified BigQuery Developer certification. From there, we took a look at Google's history as far as their big data stack goes, past and present.


We then got familiar with BigQuery. We defined what it is, why we should use it, we compared it to MapReduce and saw that it's more of a complement than a comparison. And then we took a look at a really high-level end-to-end workflow. At the end here, we just took a brief demo to show you how easy it is to get into BigQuery, write queries against your big data, and get those results fast.


I hope this has been informative for you. And I'd like to thank you for viewing.

BigQuery Basics

BigQuery Basics Demo

Importing and Exporting Data

Importing and Exporting Data Demo

Querying Data Basics

Querying Data Basics Demo

Querying Data Advanced

Querying Data Advanced Demo

Managing Data

Managing Data Demo

BigQuery Internals

Programmatic BigQuery

Programmatic BigQuery Demo (Python)

Programmatic BigQuery Demo (Java)

BigQuery Integration and Visualization Tools

Google Cloud Platform Qualified Developer

Please help us improve by sharing your feedback on training courses and videos. For customer service questions, please contact our support team. The views expressed in comments reflect those of the author and not of CBT Nuggets. We reserve the right to remove comments that do not adhere to our community standards.

comments powered by Disqus
Intermediate 6 hrs 17 videos


Training Features

Practice Exams
These practice tests help you review your knowledge and prepare you for exams.

Virtual Lab
Use a virtual environment to reinforce what you are learning and get hands-on experience.

Offline Training
Our iOS and Android mobile apps offer the ability to download videos and train anytime, anywhere offline.

Accountability Coaching
Develop and maintain a study plan with one-to-one assistance from coaches.

Supplemental Files
Files/materials that supplement the video training.

Speed Control
Play videos at a faster or slower pace.

Included in this course
Pick up where you left off watching a video.

Included in this course
Jot down information to refer back to at a later time.

Closed Captions
Follow what the trainers are saying with ease.
Garth Schulte
Nugget trainer since 2002