Best Sources of Datasets for Machine Learning

An Experience from Kaggle Competition on Human Protein Classification

January 18, 2022

"Top sources for structured/unstructured datasets which helps in machine learning/data science experiments"

This blog post would like to cover the prominent and essential sources that anyone can freely find and use under open or limited access. Here, we list the frequently used, well-established and reliable dataset sources. Please see the terms and conditions of usage of every dataset you download before using them.

  • Kaggle
  • Google Dataset Search
  • University dataset sources
  • Government dataset sources
  • Registry of Open Data on AWS (RODA)
  • FiveThirtyEight
  • Open Corporates
Kaggle:

Kaggle is a platform established for competitions and eventually became one of the world's prominent sources of cleaned structured/unstructured datasets. Currently, it holds more than 50,000 datasets and 400,000 public notebooks available on those datasets. It is an excellent start for anyone who would want to understand the data science domain.

Google dataset search:

This dataset is the search engine meant for the datasets. We can search over millions of datasets through keyword search across the web. Google also has Google Public Data Provider, which includes high-quality datasets.

University dataset sources:

As part of their various researches, many renowned Universities across the globe regularly make many datasets public and allow anyone to access them on their websites. Some of such sources are:

Recently, many universities have been following the same approach of making their datasets public.

Government dataset sources:
Registry of Open Data on AWS (RODA):

Being the most significant data analytics service provider, AWS also maintains a registry of open datasets available for its users. We can also download them free from their website.

FiveThirtyEight:

This site maintains the datasets related to various sectors like politics, sports, economics, etc.

Open Corporates:

Like universities and Government agencies, the Open Corporates is the largest dataset source for more than 200 million companies.

These are not the end of the list. We also have so many other dataset sources. Those areas are affluent, as the list mentioned above. Some of them are listed below:


GitHub's BuzzFeed News

Please feel free to mention any other exciting datasets not available in the above list.

Dr Santhoshkumar S PhD,
Researcher | Senior Technology Journalist

Get a FREE Digital Marketing Review