Getting started with Pandas library

Pandas is a powerful Python library that is commonly used by data scientists and data analysts to explore, clean, and process their data. Typically, when working with Pandas, you will read your data into a DataFrame which provides a table-like structure for you to work with. In this article, we’ll be discussing some common methods that you can expect to use when working with Pandas. 

head()

This method prints out the first 5 rows of your data. You can also pass an integer as an argument to display more rows. This method lets you take a quick glance at the data. This can help you identify if something is glaringly wrong with the data and should be investigated further before moving on to any further data processing or modeling.

info()

This method is extremely useful when performing exploratory data analysis (EDA). EDA is useful because it can help identify any obvious errors or reveal patterns or outliers before any statistical modeling is done. Essentially, EDA can help uncover interesting patterns, confirming any questions we may have had about the dataset, or leading us to ask further questions before we apply any further analytical techniques. 

The info() method prints out information about your DataFrame including the data types of your columns, and the number of non-null values in the columns. This can help you quickly identify if certain columns need to be cast to a different data type, and if there might be any missing values in your data. Perhaps a date column is actually just a string, or a column containing a monetary value is actually a string when it should be a float. Scanning the printout of the info() method can help you quickly identify issues like these. 

to_datetime()

Perhaps after using the info() method, you noticed that one of your date columns was actually just a string. You can use the to_datetime() method to convert that column to a datetime data type. Converting the data type of these date columns will allow you to perform datetime calculations, such as computing the date difference in days, which would not be possible if the column remained a string. 

value_counts()

This method is helpful to get the number of times each unique value in a column occurs. This can help you identify if your data is evenly distributed. You probably want to know about the different possible values in a particular column, and the number of occurrences of each unique value. By referencing a column name in your DataFrame and using the value_counts() method, you will see how many times each unique value occurred in that column. For example, you might have a fruit column in your DataFrame. The value_counts() method will output how many rows each unique fruit appears in. 

nunique()

This method displays the number of unique values in each column of your dataframe. If you reference a column, it would display the number of unique values in that particular column. Perhaps you have a very large dataset with lots of unique values. It would be difficult to determine how many unique values from the value_counts() method. The nunique() method would display this information more clearly.


Here is a great tutorial of the Pandas Library on W3 Schools.

 
 

You may also like

 
Andrew Dang

My name is Andrew Dang. I love creating and automating data pipelines. What excites me the most is the process of figuring out what transformations need to be applied to the raw data to map it to a form that enables stakeholders to answer their specific business questions. Every new data source is a new puzzle, and I love nothing more than a challenging problem.

https://www.andrewdang.ca
Next
Next

Scraping data to create a custom dataset