Prof. Frenzel
8 min readSep 30, 2024
#KB Cloud — Part I: Google Colab for Data Projects

Dear Data Scientists,

Has your code ever run flawlessly on your system, only to cause issues when a teammate tries to run it? This common ‘it works on my machine’ scenario can slow down collaboration, particularly in data science projects where consistency is key. Moving from local setups like Jupyter Notebook or RStudio to a cloud-based environment like Google Colab can eliminate these frustrations. Let’s take a closer look at how Colab can simplify this process and why it’s a great option for your projects.

What is the Cloud?

Cloud computing refers to the delivery of computing services — including servers, storage, databases, networking, software, analytics, and intelligence — over the Internet (“the cloud”). It allows users to access computing resources on-demand without the need to invest in and maintain physical infrastructure. With this approach, you can easily adapt to the changing computational and storage requirements of data projects, while keeping costs manageable.

Understanding Cloud Service Models

Cloud computing is structured around several service models that define the level of control and management you have over the computing resources:

  • Software as a Service (SaaS): In this model, software applications are delivered over the internet on a subscription basis. Users access the software via a web browser, and the service provider manages the infrastructure, platform, and software. Examples include Google Workspace and Salesforce.
  • Infrastructure as a Service (IaaS): This model provides the basic building blocks for cloud IT. It offers services such as virtual machines, storage, and networks, allowing users to rent infrastructure on a pay-as-you-go basis. Users have control over operating systems and deployed applications but not the underlying cloud infrastructure.
  • Platform as a Service (PaaS): PaaS delivers hardware and software tools over the internet, typically those needed for application development. It allows developers to focus on writing code without worrying about managing the underlying infrastructure. Services include development tools, database management, and business analytics.

Google Colab is considered a blend of PaaS and SaaS. It provides a platform for writing and executing Python code through a web-based interface, without the need to install software locally. Colab handles the setup and maintenance of the underlying infrastructure and environment, enabling you to focus on developing your data science projects. As you explore these cloud service models, you’ll notice how tools like Google Colab align with the larger cloud structure and can improve your workflow.

Why Google Colab?

Google Colab offers a browser-based environment that supports Python development and eliminates many of the common issues with local setups. Whether you’re working on a small assignment or a computationally intensive machine learning project, Colab provides a consistent, sharable, and collaborative environment for you and your team.

Key Advantages of Google Colab

Here are some compelling reasons to consider using Google Colab for your data projects:

  • Consistent Environment: A common issue in team coding projects is getting everyone on the same page with their environment. Library or OS version mismatches can cause bugs that are tough to debug. Colab makes things easier by providing a cloud-based environment where the code and libraries behave the same for all users. Sharing the notebook lets your teammates work in the same conditions, helping avoid dependency headaches.
  • Collaboration Features: Google Colab integrates seamlessly with Google Drive and allows real-time collaboration. Similar to Google Docs, multiple users can work on the same notebook, leave comments, and suggest edits. This feature enhances team projects, where dividing tasks and reviewing code are essential. The built-in version control makes it simple to track changes and revert to previous versions if needed.
  • Access to Free GPU Resources: If you’re working on data-heavy or computationally demanding tasks like deep learning, Google Colab offers access to GPUs and TPUs for free. Leveraging these resources for tasks like training machine learning models allows for faster execution compared to standard CPUs, without the need for expensive hardware.

Comparison with Other Cloud-Based Notebooks

While Google Colab is a powerful tool, it’s not the only cloud-based notebook environment available. Other platforms include:

  • Azure Notebooks: Offered by Microsoft, Azure Notebooks provides a free hosted Jupyter notebook service. It supports multiple programming languages and integrates with Azure services. It’s suitable if you’re working within the Azure ecosystem.
  • Amazon SageMaker Notebooks: Part of AWS, SageMaker offers managed Jupyter notebooks with the ability to scale compute resources. It’s designed for building and deploying machine learning models at scale.
  • Kaggle Kernels: Kaggle provides free access to Jupyter notebooks with up to 20 GB of disk space. It’s a great option if you’re participating in Kaggle competitions or want access to a community of data scientists.

Each platform has its advantages and fits different needs. I often prefer Google Colab for its simple interface, smooth integration with Google Drive, and the bonus of free GPU access. It’s an easy option when you want to get started quickly, without going through complex cloud setups, and it works well if you’re already familiar with Google’s ecosystem.

Getting Started with Google Colab

Creating and Sharing Notebooks

To create your first Colab notebook, simply visit Google Colab and sign in with your Google account. From here, you can start a new notebook, which will be automatically saved to your Google Drive. Sharing it with others is just as easy — click on the “Share” button in the top right corner, set the appropriate permissions, and share the link.

Using Code and Text Cells

Colab lets you document your code clearly with text cells written in Markdown, a lightweight formatting language that helps you add structure to your notes. This way, you can explain your code’s logic, display formulas, or add helpful notes, making the notebook easier for others to follow. By mixing code cells with text, your notebook becomes both a functional tool and a comprehensive report.

Google Drive Integration for Larger Datasets

Handling large datasets? Colab provides easy integration with Google Drive, allowing you to mount your Drive and directly access files stored there. This eliminates the need to download large datasets to your local machine and ensures you can access them in a cloud-friendly manner.

How to Change Runtime

While Google Colab is primarily designed for Python, you can run R code by installing the IRkernel. Here’s how:

1. Install R and the IRkernel in your Colab notebook:

# Install R
!apt-get install -y r-base

# Install IRkernel
!R -e "install.packages('IRkernel'); IRkernel::installspec(user = FALSE)"

2. Change the runtime to R:

Go to Runtime > Change runtime type, and select R from the Runtime type dropdown menu. Now you can write and execute R code cells in your notebook.

Using Python:

from google.colab import drive
drive.mount('/content/drive')

# Access data stored in Google Drive
data_path = '/content/drive/My Drive/datasets/my_data.csv'

Using R code:

# Install the googledrive package
install.packages('googledrive')
library(googledrive)
# Authenticate and list files
drive_auth()
drive_find(n_max = 10)

# Download a file from Google Drive
drive_download('my_data.csv', path = 'my_data.csv')

# Read the data into R
data <- read.csv('my_data.csv')

Google Authentication

When you’re working with Google services, especially when accessing Google Drive, you’ll encounter a process called OAuth 2.0 authentication. This is a security protocol that allows your application to interact with your Google account safely, without storing your actual Google password. The messages you see during this process are part of the authentication flow.

The first prompt may ask if you want to save the authentication token locally, which can save you from having to re-authenticate every time you run your script. The second part, where you’re directed to a webpage, is Google’s way of ensuring you’re granting permission to the right application.

After you’ve granted permission and received the authorization code, your application creates a token that it can use to access your Google Drive on your behalf. This token acts like a special key that lets your application access your Google Drive without knowing your actual password.

Best Practices in Google Colab

Setting up effective collaborative workflows in Colab can greatly enhance team productivity. Here are some guidelines:

  • Structuring Notebooks: Organize your notebook with clear sections and headings. Use Markdown cells to outline the purpose of each code block, making it easier for team members to follow along. For example, you might divide your notebook into sections like “Data Loading,” “Data Preprocessing,” “Model Training,” and “Evaluation.” Within each section, include explanations of what the code is doing and why. This approach helps team members understand the workflow and contribute more effectively.
  • Establish Coding Standards: Agree on coding conventions within your team, such as variable naming, code formatting, and documentation styles. Consistency improves readability and maintainability. For instance, decide whether to use snake_case or camelCase for variable names, and stick to one style throughout the project. Using consistent indentation and commenting practices also makes the codebase easier to navigate.
  • Utilize Commenting and Sharing Features: Take advantage of Colab’s commenting system to discuss code snippets directly within the notebook. This facilitates asynchronous communication and code reviews. Team members can highlight specific lines of code and leave comments or suggestions, which is especially helpful when collaborating across different time zones.
  • Collaborate Effectively: Use Colab’s sharing features to work with your team in real-time. Multiple users can edit the notebook simultaneously, which is great for pair programming or brainstorming sessions. Real-time editing allows team members to see changes as they happen, enabling immediate feedback and collaboration. The commenting feature can be used to leave feedback or ask questions without altering the main content.
  • Integration with Version Control Systems: While Colab has basic version history, integrating with version control systems like GitHub or GitLab provides more robust tracking and collaboration features. You can save your Colab notebooks directly to GitHub repositories and synchronize changes. Managing versions through Git allows you to handle different branches and merge changes, which is important for larger projects with multiple contributors.
Integration with Github

🚧Limitations of Google Colab

  • Session Timeouts: Colab sessions tend to timeout if you leave them idle for too long. Personally, I prefer to save my notebooks manually, though linking it to Google Drive does handle auto-saves. When working on something important, I make sure to download a backup locally.
  • Storage Constraints: The file system in Colab is temporary and resets when the session ends. For persistent storage, integrate with Google Drive or use the file upload feature. Be aware that working with large datasets may require additional storage solutions.
  • Resource Limits: The free version of Colab has usage limits on CPUs, GPUs, and RAM. These resources are allocated dynamically and can vary based on the current demand. If you need more consistent or powerful hardware, consider upgrading to Colab Pro.
  • Session Expiry: Colab’s sessions have a limited duration (typically 12 hours), after which they expire, and any unsaved data in memory is lost. Be sure to save your work regularly and upload any critical files to persistent storage like Google Drive.
  • Internet Dependency: Colab requires a stable internet connection. For offline work, consider setting up a local Jupyter environment as a backup.

Google Colab’s cloud environment removes the hassle of setting up locally, supports real-time teamwork, and gives you access to robust computing resources — all for free. Whether you’re cleaning data, training machine learning models, or sharing your work, Colab makes it easier to get things done. Making the switch to Colab solves more than just technical setup challenges — it opens up new ways to work together on your data science tasks. Give it a try — it might change how you approach your next task.

Prof. Frenzel

Data Scientist | Engineer - Professor | Entrepreneur - Investor | Finance - World Traveler