約維安計畫：Pandas

第三十五週

Dec 14, 2023

∙ Paid

來源：https://pandas.pydata.org/about/citing.html

Pandas 與熊貓的關係

在 Pandas 模組問世以前，Python 缺少一個與 Excel 試算表、R data.frame、SAS Datasets 或 Matlab table 功能相似的資料結構，Pandas 作者 Wes McKinney 認為表格式資料（或稱有標籤的二維陣列 Labeled two-dimensional arrays）是 Python 在資料科學上的最後一塊拼圖，因此他結合了 Python 原生資料結構 tuple 與 set 的特性，創造出 Index 類別，並進一步將 Index 類別整合了 NumPy 模組的 ndarray 類別，創造出 Series 類別與 DataFrame 類別，其中 Series 類別是在一維陣列加上一組標籤、DataFrame 類別是在二維陣列加上兩組標籤。

（沒什麼用的冷知識）在更早期的 Pandas 版本中，還有 Panel 類別是在三維陣列加上三組標籤，因此 Panel、DataFrame 與 Series 取開頭的數個英文字母結合，恰巧就是 Pandas，有此一說這是模組命名緣由，因此 Pandas 與熊貓的關係可以說是八竿子打不著，不過 Panel 類別很快地被棄用（Deprecated），因此現今 Pandas 主要的三個類別為 Index、Series 與 DataFrame。

安裝與載入 Pandas

Pandas 的安裝可以透過 pip 或者 conda，如果讀者的 Python 是從 Python.org 下載的就直接透過 pip 安裝，假如是約維安計畫的成員，Python 應該是 Miniconda 版本，那麼建議可以透過 conda 安裝1。

$ pip install pandas==MAJOR.MINOR.PATCH
$ conda install pandas==MAJOR.MINOR.PATCH

如果希望從既有的 Pandas 版本更新升級，可以在 install 指令之後加上 -U 或 --update-deps 參數。

$ pip install -U pandas==MAJOR.MINOR.PATCH
$ conda install -c conda-forge --update-deps pandas==MAJOR.MINOR.PATCH

官方文件建議載入模組的所有功能並且縮寫為 pd。

import pandas as pd

利用兩個模組內建屬性檢視 Pandas 的安裝版本與路徑。

print(pd.__version__)
print(pd.__file__)

怎樣學好 Pandas

學好 Pandas 的關鍵是理解其核心的三個主要類別以及它的根基類別 ndarray，瞭解四個類別彼此之間的關係，就能掌握 Pandas 入門與進階的技巧。

NumPy 的 ndarray 類別是一種可以進行向量化運算的多維度陣列。
Pandas 的 Index 類別是一種結合了 Python 原生資料結構 tuple 與 set 特性的一維資料結構。
Pandas 的 Series 類別是在一維 ndarray 上添加一組 Index 類別的資料結構。
Pandas 的 DataFrame 類別是在二維 ndarray 上添加兩組 Index 類別的資料結構，也可以想成數個 Series 類別共享相同的 Index 類別。

如何創造 Index

以 Index 類別轉換一維資料結構創造 Index。

primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
prime_indices = pd.Index(primes)
print(prime_indices)
print(type(prime_indices))

Index 類別具備了 Python 原生資料結構 tuple 與 set 特性，我們首先觀察它的 tuple 特性：不可更動（Immutable）。

try:
    prime_indices[-1] = 31
except TypeError as e:
    print(e)

接著我們再觀察它的 set 特性：集合運算（Set operation）。

odds = range(1, 30, 2)
odd_indices = pd.Index(odds)
print(prime_indices.intersection(odd_indices))
print(prime_indices.union(odd_indices))
print(prime_indices.symmetric_difference(odd_indices))
print(prime_indices.difference(odd_indices))
print(odd_indices.difference(prime_indices))

如何創造 Series

以 Series 類別轉換一維資料結構創造 Series。

prime_series = pd.Series(primes)
print(prime_series)
print(type(prime_series))

我們首先觀察 Series 是在一維 ndarray 上添加一組 Index 類別的特性。

print(type(prime_series.index))
print(type(prime_series.values))

建立 Series 的時候，我們可以選擇自訂其添加的 Index 類別，預設是使用從零開始的數列。

prime_series = pd.Series(primes, index=range(1, 11))
print(prime_series)
print(prime_series.index)

由於 Series 的組成中有一個 ndarray，在操作上具備了所有 ndarray 特性，包含向量化運算、華麗索引（Fancy indexing）以及布林索引（Boolean indexing）。

print(prime_series**2) # Vectorization
print(prime_series[[0, 1, 9]]) # Fancy indexing
print(prime_series[prime_series >= 10]) # Boolean indexing

如何創造 DataFrame

以 DataFrame 類別轉換二維資料結構成為 DataFrame，由於 DataFrame 是在二維 ndarray 上添加兩組 Index 類別的資料結構，在建立的時候我們能夠選擇以欄為單位（Column-wise）或者以列為單位（Row-wise）兩種輸入資料的方式。

首先以欄為單位輸入資料建立 DataFrame。

movie_df = pd.DataFrame()
movie_df["title"] = ["The Shawshank Redemption", "The Dark Knight", "Schindler's List", "Forrest Gump", "Inception"]
movie_df["imdb_rating"] = [9.3, 9.0, 8.9, 8.8, 8.7]
print(type(movie_df))
movie_df

亦可以列為單位輸入資料建立 DataFrame，利用 Python 原生資料結構 list 去結合多組鍵（Key）都相同的 dict。

movie_list = [
    {"title": "The Shawshank Redemption", "imdb_rating": 9.3},
    {"title": "The Dark Knight", "imdb_rating": 9.0},
    {"title": "Schindler's List", "imdb_rating": 8.9},
    {"title": "Forrest Gump", "imdb_rating": 8.8},
    {"title": "Inception", "imdb_rating": 8.7}
]
movie_df = pd.DataFrame(movie_list)
print(type(movie_df))
movie_df

我們可以觀察到 DataFrame 是在二維 ndarray 上添加兩組 Index 類別的特性。

print(type(movie_df.values))
print(type(movie_df.index))
print(type(movie_df.columns))

入門 Pandas 之後，我們就能夠在下次的電子報中基礎的資料框操作技巧。第三十五週約維安計畫：Pandas 來到尾聲，希望您也和我一樣期待下一篇文章。對於這篇文章有什麼想法呢？喜歡😻、留言🙋‍♂️或者分享🙌

數聚點