15 Places to Find Free Datasets for Your Data Science Projects

If you’ve ever worked on a personal data science project, you’ve probably spent a lot of time scouring the internet for interesting datasets to analyze.

It can be fun to sift through dozens of datasets to find the best fit, but it can also be frustrating to download and import multiple CSV files, only to find that the data is just missing, not so interesting. Fortunately, there are online repositories that keep sets of data and (mostly) remove uninteresting ones.

The Essential Data Cleansing Checklist

Data quality issues, such as missing, duplicate, inaccurate, valid, and inconsistent values, cause headaches in finding and using data sets. Having a suitable data cleansing procedure handles this bad data and makes it suitable for other people and systems.

A helpful data cleansing process standardizes data, fixes, or removes erroneous values, and formats records to be readable. You get these adequate results from data cleansing when you know your data’s original purpose and visualize the good data you require to meet new goals. You need to create a good foundation and run through the essential data cleansing checklist in this article to achieve your objectives.

Splitting Lines and Numbering the Pieces

As I mentioned in my computational survivalist post, I’m working on a project where I have a dedicated computer with little more than basic Unix tools, ported to Windows. It’s given me a new appreciation for how the standard Unix tools fit together; I’ve had to rely on them for tasks I’d usually do a different way.

I’d seen the nl command before for numbering lines, but I thought, “Why would you ever want to do that? If you want to see line numbers, use your editor.” That way of thinking looks at the tools one at a time, asking what each can do, rather than thinking about how they might work together.

Practical Strategies to Handle Missing Values

One of the major challenges in most BI projects is to figure out a way to get clean data. 60 to 80 percent of the total time is spent on cleaning the data before you can make any meaningful sense of it. This is true for both BI and Predictive Analytics projects. To improve the effectiveness of the data cleaning process, the current trend is to migrate from the manual data cleaning to more intelligent machine learning-based processes.

Identify the Type of Missing Values We Are Dealing With

Before we dig into figuring out how to handle missing values, it's critical to figure out the nature of the missing values. There are three possible types, depending on if there exists a relationship between the missing data with the other data in the dataset.