Finding datasets

There are tons of places to find free and publicly available datasets that you can use for work or personal projects. Some of the sources we often turn to include Kaggle, Google Dataset Search, Statista, Our World in Data, and Open Data Sources. But what do you do when you’re not able to find the information you need or you’re only able to find part of the information you need? Well, we ran into this issue when exploring countries with the most national parks. We couldn’t find a dataset with the information we were seeking — number of national parks in each country; the names of the national parks within each country; and the GPS coordinates of each of the parks so we could plot them on a map. We had access to an International Union for Conservation of Nature (IUCN) dataset but it was not clear whether the parks were national parks or another category of protected areas like a reserve, for example. Luckily, Wikipedia has all this information, but the challenge was that all the data was on different pages. We turned to our good friend Andrew Dang, a data engineer, to help us build this dataset. He scraped Wikipedia and created the spreadsheet we needed. Read on to find out how he did it.

Scraping data

While doing some research, we realized that Wikipedia had the information we needed but it was scattered on various pages and organized in different formats. We tried to look for a dataset that contained the information we were looking for, but we could not find one. The solution we landed on was to scrape the data from Wikipedia using Python. The web scraping code and data files can be found on GitHub.

Methodology

The data was scraped from Wikipedia using Python. The BeautifulSoup library was used to parse the HTML of each webpage.

The scraping process began by scraping a Wikipedia page (we will call this the main page moving forward) that contains several tables that listed the number of national parks for each country. Each country in these tables contained a URL for that country. By scraping the tables in the main page, we aimed to get the name of each country, the number of national parks in that country, and the URL for that country.

Once we had the URL for each country, we scraped its contents. By scraping the country URL, we aimed to get the name and URL of each national park found in that country. The country URLs did not have a consistent structure across all countries, and required many if-then statements to account for these differences when looking for the data we were interested in. However, in most cases, the names and URL of the national parks were organized either in a single table or unordered lists, or several tables or unordered lists. In many cases, a table or unordered list could be found directly after an HTML header containing the text “National Park”. By using BeautifulSoup to look for specific HTML tags, attributes, and elements in each country’s webpage, we were able to get the name and URL of the national parks for most countries. Additional code was written to collect national park names and URLs for countries that did not conform to this structure.

After obtaining the URL of each national park, we scraped its contents to get the latitude and longitude of the national park. The values for these coordinates were listed in degrees minutes seconds on the webpages. These coordinates were converted to degrees decimal using the `dms2dec` library. The geographic coordinates of each national park can be found in both degrees minute seconds and degrees decimal in the `national_parks.csv` file.

The scraped data was organized in a nested dictionary, where each country name was a key. The value for each key was another dictionary. This inner dictionary stored the URL for a country, the number of parks listed on the main Wikipedia page for this country, the number of parks were able to find coordinates for in this country, and finally, yet another dictionary that stored the names and URL for each national park found in this country.

The scraped results were ultimately converted and organized into a Pandas DataFrame and then exported as a CSV file (saved as `national_parks.csv`). Each record contained a country name, a national park name, a national park URL, and the longitude and latitude in both degrees minute seconds and degrees decimal. The records were built by looping through each national park in each country within the scraped results dictionary and appending the relevant data to a list with each iteration. Many of the park names and country names had additional text which were removed to clean the dataset. A similar process was done to create the `missing_coordinates.csv` and `summary_table.csv` files.

Limitations

The main Wikipedia page lists a total of 3,257 national parks worldwide. We were only able to find coordinates for 2,836 national parks and have added them to this dataset. There are 90 national parks that are missing coordinates due to missing country URLs, 333 national parks are missing coordinates due to missing national park URLs, and another 37 national parks are missing coordinates due to the coordinates not being present in the national park URL.

If the country did not have a URL in the main Wikipedia page, then it was not possible to get the name and coordinates of the national parks of that country. Likewise, if the national park itself did not have a URL, then the coordinates would not be scraped. Occasionally, the coordinates were present in a table within the country URL. In this scenario, the coordinates for the national park were scraped. The absence of a national park URL does not always mean the absence of geographic coordinates. In most cases however, not having a national park URL meant that we scraped a lower number of national parks for a country than what was listed on the main Wikipedia page.

Upon investigation, it appeared that some country webpages had parks that were designated as a national park using their own national definition rather than the IUCN definition. On other occasions, some web pages listed decommissioned national parks. The scraper did not account for these scenarios. There were other country webpages that listed other protected areas, such as conservation areas, that did not have the national park designation. While an attempt was made to filter out the non-national parks, it was not always successful and a handful of records in the dataset do not fall under the IUCN definition of a national park. The scenarios described above resulted in some countries having more national parks scraped than what was listed on the main Wikipedia page.

Validating the data

We checked our dataset against the figures available on Wikipedia. We weren’t able to find all the information we were looking for using this method. Why is important to validate your data? Well, when presenting data in a table, graph, map, or whatever your choice of data visualization may be, it is important that the data is accurate and/or complete. Without checking to ensure that the data is accurate or complete, we run the risk of coming to incorrect conclusions about the data and potentially misleading the audience.

Featured

FWD EDITORS

Charted: The Happiest Countries In the World 2025

FWD EDITORS

Finland is the happiest country in the world, according to the latest World Happiness Report. It’s the eighth consecutive year the Nordic country has topped the list. Following Finland are fellow Nordic countries Denmark, Iceland, and Sweden. The Netherlands rounds out the top five.

Read More →

FWD EDITORS

FWD EDITORS

Gold, Economics

Which countries produce the most gold

FWD EDITORS

Gold, Economics

In 2023, mines in China produced an estimated 378.2t of gold, about 11% of the gold produced globally, making it the top gold-producing nation in the world, a position it has maintained for the past 15 years. Russia, Australia, Canada and the United States are among the top five.

Read More →

FWD EDITORS

Gold, Economics

Charted: Where North America gets its electricity

FWD EDITORS

Charted, Energy

Charted: Where North America gets its electricity

FWD EDITORS

Charted, Energy

As the world races to decarbonize and bring on cleaner sources of electricity, it got us wondering how North America’s electricity grid stacks up against the rest of the world. In this story, we explore North America’s electricity mix in 2023 and how it has changed over the years.

Read More →

FWD EDITORS

Charted, Energy

Charted: NATO defense spending by country

FWD EDITORS

Economics, Charted, Defense

Charted: NATO defense spending by country

FWD EDITORS

Economics, Charted, Defense

Founded following the end of World War II, the goal of NATO is to secure lasting peace in Europe and North America. In this story, we use data from the CIA World Factbook and NATO, to break down the national defense expenditures for each NATO member nation.

Read More →

FWD EDITORS

Economics, Charted, Defense

Featured

FWD EDITORS

How is the Pope selected?

FWD EDITORS

Pope Francis, the first pope from Latin America and leader of the Catholic Church for more than a decade, died on April 21, 2025. After a period of mourning, cardinals from around the world will gather in the Vatican to select a new pope.

Read More →

FWD EDITORS

United States national parks by the numbers

FWD EDITORS

Economics, Nature, Stories

United States national parks by the numbers

FWD EDITORS

Economics, Nature, Stories

In 2023, more than 92 million visited U.S. national parks, a record high. With 63 national parks across 30 states and two territories, there is something for everyone. In this story, we use data from Wikipedia and the National Park Service to explore the largest parks by area, the most popular parks, and rank parks by the most animal species.

Read More →

FWD EDITORS

Economics, Nature, Stories

FWD EDITORS

Stories, Economics

The world’s largest trading blocs

FWD EDITORS

Stories, Economics

The Regional Comprehensive Economic Partnership came into effect in January of 2022 and is the world’s largest trading bloc. It’s 15 members is estimated to account for 30% of global population and 30% of global GDP.

Read More →

FWD EDITORS

Stories, Economics

FWD EDITORS

Stories, Economics

The G20 adds a new member

FWD EDITORS

Stories, Economics

The Group of Twenty (G20) has a new member — the African Union. The admission of the African Union marks the first time the G20 has added a new member since it was formed in 1999.

Read More →

FWD EDITORS

Stories, Economics

Scraping data to create a custom dataset

Finding datasets

Scraping data

Methodology

Limitations

Validating the data

You may also like

About Us

Scraping data to create a custom dataset

Finding datasets

Scraping data

Methodology

Limitations

Validating the data

You may also like

Getting started with Pandas library

Cleaning your data with Google Sheets

About Us