#KB Data Sourcing II

9 min readFeb 18, 2023

Dear friends!

As a data or business analyst, you may be struggling to find the right projects to showcase your skills and expertise. Maybe you have noticed that the projects you have been working on are not aligning with your career goals, or you may be wondering where to find high-quality data sets. In this second article of the “Data Sourcing” series, I will provide an in-depth overview of open-source data analytics projects. I will share valuable insights on the different types of projects available, and provide advice on what to look for in a project that aligns with your personal interests and career goals. Additionally, I will offer a comprehensive list of resources to help you find the right open-source project for your needs. Are you ready? Let’s go!

Open Source Projects (OSP) …what? …why?

Building a compelling portfolio that emphasizes your proficiencies and expertise is a must for anyone pursuing a career in this industry, PERIOD. Not only will this aggregation of your own projects (see my ➡️ AP articles) demonstrate your abilities to potential employers, but it will also allow you to explore and learn new tools and features that are essential in today’s data-driven world.

Open source has become an increasingly popular choice for developers, data analysts, and data scientists alike. One of the main reasons for the popularity of open-source is the fact that it is publicly available, meaning that everyone can access, contribute to, and modify the source code or dataset. This creates a collaborative environment that encourages innovation and the sharing of ideas. Additionally, open-source projects are often maintained by a community of developers and data professionals who volunteer their time to improve and maintain the project, resulting in a high-quality, well-supported product. In the data analytics industry, this is particularly important because it allows for easy access to high-quality datasets and the latest tools and technologies, making it easier for us to stay on top of trends and advancements in the field.

What types of OSPs exist?

When it comes to open-source data analytics projects, there are many options to choose from. While the difficulty of these project types is subjective and can vary depending on the specific project and individual’s level of experience and expertise, I attempted to rank them based on the level of technical knowledge and skills required, the complexity of the data sets and analysis methods involved, and the potential for the projects to involve cutting-edge technologies.

Data Visualization: Projects that focus on data visualization are a great place to start for those who are new to data analysis. These projects often include data sets that are easy to understand and offer a variety of visual representations of that data.
Data Analytics Projects: Projects that involve exploring and analyzing data to uncover insights, often using statistical and machine learning techniques. These projects require skills in data visualization, statistical analysis, and programming.
Machine Learning: Projects that focus on machine learning can be more complex, but they offer a great opportunity to learn about cutting-edge technologies and apply them to real-world data sets. Many machine learning projects utilize popular 🖥️ Python libraries such as Scikit-learn and TensorFlow, making them great for learning new skills.
Natural Language Processing: Natural language processing (NLP) is a growing field in data analytics (🖥️ChatGPT… duh), and open-source projects in this area can offer exciting opportunities to work with large and complex data sets. NLP projects involve working with text data and may include tasks such as sentiment analysis, topic modeling, and entity recognition.
Time Series Analysis: Time series analysis projects focus on working with data sets that change over time. These projects often require a strong understanding of statistical methods and can be challenging but rewarding for those interested in forecasting and prediction. Remember your econometrics course? That’s too bad — you should have taken it with me!
Data Engineering Projects: Projects that focus on building and optimizing data pipelines, data warehousing, and data architecture. These projects often require skills in 🖥️ SQL, ETL tools, and data modeling.
Big Data Projects: Projects that involve handling and processing very large datasets, often with distributed computing technologies like 🖥️Apache Hadoop and Apache Spark. These projects require skills in data storage, distributed computing, and programming.

Data visualization and analytics projects, in my opinion, fall in the easier category because they mainly focus on the exploration and analysis of datasets. Machine learning projects are more complex, but they are still accessible to beginners, especially those who are familiar with machine learning platforms. Natural Language Processing projects are more advanced than traditional machine learning projects, as they require a deep understanding of text data and more advanced techniques like sentiment analysis and entity recognition. Time series analysis projects can be as challenging, as they involve working with datasets that change over time and require a strong understanding of statistical methods. Data engineering projects, big data projects, and other types of projects that involve distributed computing and data processing are probably the most difficult because they require specialized knowledge and skills in areas like data storage, distributed computing, and programming.

When looking for open-source data projects, it’s important to consider a few key factors:

In my experience, staying dedicated and disciplined is easier when you consider these three questions early on.

What are the Challenges?

While OSPs provide a vast amount of data, sourcing the right data for your project can be a challenge. Here are some common challenges you may encounter while sourcing data for your projects:

Quality and Quantity: Not all data available on the internet is of high quality. You may need to spend a significant amount of time cleaning, processing, and structuring data to make it usable for your project. Moreover, certain datasets may lack sufficient data to offer valuable insights, or even fail to provide a data key.
Compatibility and Integration: If you are working with various data sources, you may face compatibility issues, such as different data formats or different data structures. These challenges may require you to create a custom data integration tool or integrate a commercial solution such as 🖥️Talend, Fivetran, or Informatica.
Data Privacy and Access: Open-source data may be freely available, but some data sources may not allow you to access their data due to privacy policies, licensing restrictions, or terms of service.
Data Bias and Misrepresentation: When working with open-source data, it’s essential to understand that data may be biased or misrepresented. This can result in inaccurate results, leading to erroneous insights or decisions.

To overcome these challenges, you need to be persistent and patient. You may have to go through several sources to find the right data, and you may need to invest significant time in cleaning and processing the data. Data sourcing is just the first step, and proper data management, cleaning, and processing are critical to creating meaningful insights.

Popular Resources

GitHub: GitHub is a great resource for finding open-source projects. It has a section dedicated to data analytics projects, where users can search for and contribute to projects in a variety of languages. You can also filter projects based on their popularity, the number of stars, the number of forks, or the number of contributors.
Kaggle: Kaggle is a popular platform for data science competitions, but it also has a section dedicated to data analytics projects. Kaggle hosts a variety of datasets that can be used for data analytics projects. Users can upload their own datasets and analyze them using Kaggle’s tools, or they can participate in data science competitions to hone their skills.
Data.gov: Data.gov is a repository of over 200,000 datasets from various federal agencies in the United States. These datasets can be used for data analytics projects, and they cover a wide range of topics, including climate, education, energy, finance, and health.
UCI Machine Learning Repository: The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The datasets cover a wide range of topics, including finance, healthcare, and engineering.
OpenML: OpenML is an open-source platform for machine learning and data analysis. It provides access to over 20,000 datasets and algorithms, and it allows users to upload their own datasets for analysis. OpenML also provides tools for data preprocessing, visualization, and evaluation.
DataCamp: DataCamp is an online learning platform that offers courses on data analytics, data science, and machine learning. In addition to the courses, DataCamp also offers projects that users can work on to apply their knowledge. The projects cover a wide range of topics, including data cleaning, data manipulation, and machine learning.
Data.World: Data.world is an online community where users can find and collaborate on a variety of datasets, including open data, enterprise data, and personal data. Users can search for datasets, explore data visualizations, and connect with other data professionals to share knowledge and insights.
Google Dataset Search: Google Dataset Search is a search engine for datasets. It allows users to search for datasets based on keywords, topics, and file types. Google Dataset Search provides links to the sources of the datasets, which could be useful for finding open-source data analytics projects.
Amazon Web Services (AWS) also provides access to a wide variety of public data sets that can be used for data analytics projects. AWS has a page that lists all of the data sets that are available for you to browse. To access these data sets, you will need an AWS account. However, Amazon provides a free access tier for new accounts that will enable you to explore the data without being charged.
[what am I missing here? Please comment!]

Alternatives to OPS? Welcome to Web scraping!

Although open-source projects offer a wealth of data, there may be times when you require data that is not available in a dataset or API, and in such cases, you can use web scraping techniques to extract data from websites. There are several types of web scraping techniques you can use to e.g. collect data from movie databases, job portals, or e-commerce sites.

Manual web scraping involves manually inspecting a website’s HTML, and copying and pasting the desired data into a CSV file. While this approach may seem simple, it is generally time-consuming and not ideal for handling large datasets or scaling.
Automated web scraping involves using tools or software to automatically extract data from websites. This method is faster and more suitable for large datasets. Popular web scraping tools that require coding skills include 🖥️Beautiful Soup, Scrapy, and Selenium. Alternatively, no-code solutions like 🖥️Apify and Octoparse offer easy-to-use web scraping tools that come with free plans. Apify provides hundreds of ready-to-use tools, while Octoparse offers a cloud service to store extracted data and IP rotation to prevent IPs from getting blocked.
API scraping involves using an API to extract data from a website. APIs allow you to extract data in a structured format, making it easier to work with the data. However, not all websites offer APIs, and some APIs may have limitations on the amount of data you can extract. For example, social media platforms like Twitter, Facebook, and Instagram offer APIs that allow developers to extract data such as user profiles, posts, comments, and follower counts. By using the Twitter API, you can retrieve tweets that match a specific keyword, location, or user account. You can also extract metrics like retweets, likes, and replies to better understand how a particular tweet is performing. However, these APIs often have rate limits and may require authentication to access certain data.

It’s important to note that web scraping can be a legal grey area, and you should always check the website’s terms of service before scraping any data. Additionally, some websites may have anti-scraping measures in place, such as captchas or IP blocking, so you should be mindful of these when scraping data.

As a final piece of advice, remember that building a compelling portfolio takes time and dedication. Don’t be discouraged if you encounter challenges along the way, especially when working with open-source projects. Finding high-quality data sets that align with your interests and career goals can be difficult, and even when you do find the right project, cleaning and processing the data can be a time-consuming task. However, staying persistent and disciplined, and always being open to learning new skills and techniques can help you overcome these challenges. Do not hesitate to explore other avenues such as web scraping to source the needed data or to join a community such as Kaggle, which can provide valuable networking opportunities as well as access to a vast array of resources and datasets.

Open Source Projects (OSP) …what? …why?

What types of OSPs exist?

What are the Challenges?

Popular Resources

Alternatives to OPS? Welcome to Web scraping!

Written by Prof. Frenzel

No responses yet