A Fascinating Overview Of Python'S Pandas Library

At the foundation of any Data Scientist's workflow is the ability t analyze, clean and store large data sets efficiently. Python's Pandas Library provides those features and much, much more. In this blog post I will give a brief overview of Pandas and its capabilities.

Introduction
Basic Usage
Data Indexing and Slicing
Basic Time Series Analysis
Conclusion

Introduction

Python's Pandas library is a powerful tool for Data Scientists, because of the ease of use and powerful data manipulation methods it offers. Pandas is essentially a high-level abstraction layer over Python's built-in data structures such as Lists, Dictionaries and Numpy Arrays. On top of those basic data structures, Pandas provides an intuitive and powerful data manipulation techniques, as well as an ecosystem of powerful data analysis and visualization tools.

Basic Usage

The Pandas library provides two main data structures, the DataFrame and the Series. DataFrames and Series can be thought of as two dimensional (data per column and row) and one dimensional (data per index) Python dictionaries, respectively.

To demonstrate the basic usage of the Pandas Library, let's take a simple example of a DataFrame containing information about a few programming languages.

import pandas as pd

data_frame = pd.DataFrame({
    "language": ["python", "java", "c++", "haskell"],
    "usage": [90.3, 59.3, 48.2, 0.3]
})

The above code snippet produces a DataFrame object called data_frame that looks like this:

language	usage
python	90.3%
java	59.3%
c++	48.2%
haskell	0.3%

Once the DataFrame has been created, we can use Pandas' methods to manipulate and analyse the data easily. For example, we can easily do basic calculations on the DataFrame, like summing all the values:

data_frame['usage'].sum()

Which will return 198.1, the sum of all the usage percantages.

Data Indexing and Slicing

With Pandas' powerful indexing capabilities, it is easy to look up data based on the index. For example, we can easily look up a row in the data_frame based on the language by using the loc method:

data_frame.loc[data_frame['language'] == 'java']

This code snippet will return the row containing information about the java language.

There are also many ways to slice and dice the data using Pandas' methods. For example, to get the usage percentage for all languages except for Python, we could use the following method:

data_frame.loc[data_frame['language'] != 'python']['usage']

This will return a list of all the usage percentages for all languages except Python.

Basic Time Series Analysis

In addition to providing powerful data manipulation capabilities, Pandas also offers a rich set of tools for doing basic time series analysis. With a few simple lines of code, we can easily calculate statistical metrics like mean, median and standard deviation on time series data.

Consider the following example of a pandas Series containing monthly usage statistics of a popular website:

import pandas as pd

usage_series = pd.Series({
    "january": 97.3, 
    "february": 78.2, 
    "march": 83.5, 
    "april": 92.1, 
    "may": 68.4
})

Now, using Pandas we can compute various statistics related to this Series:

mean = usage_series.mean()
median = usage_series.median()
std = usage_series.std()

Which will give us the mean usage of 84.92, the median usage of 85.775 and the standard deviation of 9.941.

Conclusion