What is Pandas in Python? Pandas is an open-source Python library for working with datasets. It has functions for analyzing, cleaning, exploring, and manipulating data. Pandas allows us to analyze data and gives us functions to help us find information and answer questions using statistical analysis.
We can use Pandas to select and merge data and to clean messy datasets so they are easier to read and work with. Pandas is commonly used in data science, which is a branch of computer science that uses algorithms and processes to obtain knowledge and insights from data. The knowledge and insights obtained from data can be used to make decisions and implement solutions to problems.
Pandas programs can be run from any text editor, such as Replit, or from interactive coding notebooks such as Jupyter Notebook or Google Colab. Interactive coding notebooks provide the ability to execute code in a specific cell instead of executing an entire program file and an easy way to visualize datasets.
In this tutorial, we will use Google Colab to explore Pandas basics. You can make a copy of the sample notebook to follow along with the tutorial.
Join our live online class, designed by experts from MIT, Stanford, and Google, to learn everything you need to know about using Pandas for data science:
How to Use the Pandas Python Library
Before we can use the Pandas library in our programs, we first need to import it. We import Pandas in the same way we import other libraries in Python programs.
In this statement, pd is used as an alias to the Pandas library. We don't have to import Pandas using an alias, but it helps us write less code every time we need to call one of its functions.
Data is organized as Series and DataFrames in Pandas
A Series is a one-dimensional array that holds data, like a single column or row of data in a spreadsheet.
In this example, we create a Series called my_series that contains the values stored in the numbers list. When we run this code segment, we see two columns of values displayed. The first column represents the index position of the value in the Series, and the second column contains the value. Additionally, we see the data type of the values displayed.
By default, each value is labeled with the index position in the Series. The first value is at index 0.
We can label each row with a name instead of using the index position as the label. We do this by adding the index argument when we create the Series.
When we create labels, we can access a value in the Series by referring to the label.
A DataFrame is a two-dimensional data structure that consists of rows and columns. While a Series is like a column in a table, a DataFrame is the entire table.
Loading and Saving Data with Pandas
Data scientists often use Pandas for data analysis by converting lists to a Pandas DataFrame as we saw in the previous example or by loading data from a file.
Let's try reading data from a file called planets.csv into a DataFrame. The read_csv() function takes a filename as its argument and returns a DataFrame containing the data from the file. In Google Colab, we can add datasets to work with by uploading them to the Files found on the left side of the notebook. In this example notebook, we have a planets.csv file located in our content folder. When we print this dataset, we see each column and their corresponding values displayed.
Notice that the way this is displayed is not very easy to follow. We can use the to_string() function to print our dataset as a table.
We can also save the DataFrame we are working with to a new file. Let's save this DataFrame to a new file called newplanets.csv. We will then see the new file in our content folder.
We can use the loc attribute in Pandas to obtain a Series from the DataFrame. Let's get the information for the first planet in our dataset.
We can also obtain a new DataFrame that contains multiple rows from the original DataFrame. Let's get the information for the first two planets from the planets DataFrame.
Just like we can label each value in a Series, we can label each row in a DataFrame using the set_index() function. The inplace=True argument updates the existing DataFrame. If this is False, then the existing DataFrame is not updated. Instead, a new DataFrame is returned.
Now we can use the named index in the loc attribute to obtain the data for a specific planet.
Notice that this column has an id column that we don't need. Pandas has a drop() function that allows us to remove columns from our dataset.
Let's drop the id column using this function. We will use the axis argument to specify whether to drop the label from the index (0) or the columns (1). We will also use the inplace=True argument to update the existing DataFrame.
Viewing and Inspecting Data
Let's take a closer look at our data. Pandas has several functions available for obtaining information about a dataset.
For example, the head() function can be used to obtain a specified number of rows. If the number of rows is not specified, the head() function returns only the first five rows.
The info() function gives us information about the dataset, such as the number of rows and columns, the data type of each column, and the number of non-null values in each column.
Empty, or null, values can cause problems when we are analyzing data. There are techniques and functions we can use in the Pandas library to clean the data by removing or replacing null values.
Another useful function in Pandas is the describe() function, which gives us summary statistics for numerical columns in our dataset.
In addition, Pandas has the following functions to obtain information about our dataset.
planets.mean() - returns the mean of all columns
planets.count() - returns the number of non-null values in each column
planets.max() - returns the highest value in each column
planets.min() - returns the lowest value in each column
Use the Pandas Python Library
This introduction to Pandas is only the beginning of what you can do with a dataset! You can combine your knowledge of Python conditionals and loops to perform more complex analysis of the data, and there are more functions in the Pandas library you can use to find information and perform calculations. As you work with larger datasets, you will also encounter scenarios where you will need to perform data cleaning, which is an important part of data science. Additionally, data scientists often create visualizations of data to obtain a better understanding of the information the data is portraying. You can learn more about how you can use Pandas for data analytics in our Data Analytics Python class!
Written by Jamila Cocchiola who has always been fascinated with technology and its impact on the world. The technologies that emerged while she was in high school showed her all the ways software could be used to connect people, so she learned how to code so she could make her own! She went on to make a career out of developing software and apps before deciding to become a teacher to help students see the importance, benefits, and fun of computer science.