約維安計畫：建立基本資料單位「向量」

第二十五週

Feb 26, 2022

∙ Paid

基本資料單位

R 語言的基本資料單位（Building block）稱作向量（Vector），這一點與大多數廣泛用途程式語言（General-purposed programming languages）例如：C/C++ 語言、Java 語言或 Python 有明顯的差異。簡單來說，在 R 語言實踐「哈囉世界」print() 函數其中的 "Hello, World!" 對 R 語言來說就不是以純量（Scalar）來看待她，而是將其視作長度 1 的文字向量。

hello_world <- "Hello, World!"
print(hello_world)
print(class(hello_world))     # Check the class of hello_world
print(is.vector(hello_world)) # Check if hello_world is a vector or not
print(length(hello_world))    # Check the length of hello_world

以向量作為基本資料單位的設定體現了 R 語言作為一個以資料分析為主軸的程式語言理念，對她而言，任何運算都是元素操作（elementwise）。R 語言中的向量有著不同的類型，包含有：數值向量（numeric）、整數向量（integer）、文字向量（character）、邏輯向量（logical）、日期向量（Date）、日期時間向量（POSIXct POSIXt）與未定義值（NA/NULL/NaN/-Inf/Inf）。

如何建立向量

建立長度為 1 的向量非常簡單，只需要用 <- 將不同類型的資料指派給物件名稱即可，使用者不需要自己去猜測向量類型，而是使用函數 class() 或者 typeof() 讓 R 語言告訴我們該物件是什麼類型的向量，並且使用 length() 函數檢視向量長度。

hello_world <- "Hello, World!"
print(hello_world)
print(class(hello_world))  # Check the class of hello_world
print(typeof(hello_world)) # Check the type of hello_world
print(length(hello_world)) # Check the length of hello_world

建立數值向量

如果想要建立長度大於 1 的向量，最常使用的方法是利用 c() 函數（combine 的簡稱），在 c() 函數中可以將多筆資料以逗號分隔以一個物件名稱參照。

favorite_numbers <- c(7, 24, 25)
print(favorite_numbers)
print(class(favorite_numbers))  # Check the class of favorite_numbers
print(typeof(favorite_numbers)) # Check the type of favorite_numbers
print(length(favorite_numbers)) # Check the length of favorite_numbers

rep() 函數（replicate 的簡稱）可以生成包含重複資料的向量，其中 times 參數可以指定向量中要有幾個重複值。

lucky_sevens <- rep(7, times = 3)
print(lucky_sevens)
print(length(lucky_sevens)) # Check the length of lucky_sevens

建立長度大於 1 的數值向量除了如前述一般使用 c() 函數或者 rep() 函數，亦可以呼叫 seq() 函數（sequence 的簡稱）或 : 符號建立具有數列特性的數值向量。 seq() 函數可以生成等差數列，其中 from 參數指定數列的起始值，to 參數指定數列的終止值，by 參數指定數值的間距。

print(seq(from = 7, to = 63, by = 7))

如果單純是要生成數值間距為 1 的數值向量，用 : 符號更快捷。

print(1:20)

不論輸入數字帶有或不帶有小數點，R 語言預設都以數值向量（numeric）儲存。

favorite_numbers <- c(2.718, 3.14, 7, 24, 25)
print(favorite_numbers)
print(class(favorite_numbers))  # Check the class of favorite_numbers
print(typeof(favorite_numbers)) # Check the type of favorite_numbers

建立整數向量

輸入整數並加入大寫英文字母 L 作註記，R 語言就會改儲存為整數向量（integer）。

favorite_integers <- c(7L, 24L, 25L)
print(favorite_integers)
print(class(favorite_integers))  # Check the class of favorite_integers
print(typeof(favorite_integers)) # Check the type of favorite_integers

若在帶有必要小數位數的數字後加上大寫英文字母 L 註記 R 語言將忽略整數向量的註記改以數值向量儲存。

favorite_numbers <- c(2.718L, 3.14L)
print(favorite_numbers)
print(class(favorite_numbers))  # Check the class of favorite_numbers
print(typeof(favorite_numbers)) # Check the type of favorite_numbers

建立文字向量

在 R 語言中我們可以使用成雙的單引號 '' 或成對的雙引號 "" 來建立文字向量（character），多數的時候使用單引號或者雙引號不會有任何分別。

hellos <- c("Hello, World!", 'Hello, R!')
print(hellos)
print(class(hellos))  # Check the class of hellos
print(typeof(hellos)) # Check the type of hellos

但是在部分情境中，使用成雙的單引號或成對的雙引號標註文字向量是有差別的，像是在英文句子中經常出現的非成雙單引號（Apostrophe）以及用作嘲諷強調的空氣雙引號（air-quotes），當文字向量中的內容有出現這些元件時如果沒有特別關注，宣告的當下就會產生錯誤。

# Error: unexpected symbol
mcdonalds <- 'I'm lovin it'
shaq <- 'Shaquille O'Neal'
ross_geller_said <- "Let's put aside the fact that you "accidentally" pick up my grandmother's ring."

這時候我們可以使用跳脫字元反斜線 \ 來完成宣告或者使用不同樣式的引號。

mcdonalds <- 'I\'m lovin it'
shaq <- 'Shaquille O\'Neal'
ross_geller_said <- "Let's put aside the fact that you \"accidentally\" pick up my grandmother's ring."

建立邏輯向量

當我們進行判斷條件或者資料篩選的時候會需要仰賴邏輯向量（logical），邏輯向量只有 FALSE 與 TRUE 這兩個值，亦可以簡寫為 F 與 T

print(class(FALSE))
print(class(TRUE))
print(class(F))
print(class(T))

這裡特別提醒一個觀念，R 語言對於英文的大小寫是敏感的（case-sensitive），像是 TRUE 與 T 會被識別為邏輯向量，但是 True、true 或者 t 則會被視作物件以及函數名稱。

建立日期向量

在 R 語言中日期向量（Date）的外觀看起來跟文字向量並沒有什麼差別，但是當我們一但將它們放入 class() 函數中檢驗，就會發現並不是文字向量，接下來使用的 Sys.Date() 是不需要參數就會輸出電腦系統日期向量的函數。

sys_date <- Sys.Date() # system date vector
print(sys_date)        # looks like a character
print(class(sys_date))

R 語言預設以西元 1970 年 1 月 1 日作為 0，在這一天以後的每天都 +1 來記錄，而這一天以前的每天都 -1 來記錄。文字向量與日期向量兩者最大的分野，在於日期向量可以進行數值運算，而文字向量不行。

print(sys_date - 1)
print(sys_date + 1)

建立日期時間向量

在 R 語言中日期時間向量（POSIXct POSIXt）的外觀看起來跟文字向量同樣也沒有什麼差別，但是我們一但將它們放入 class() 函數中檢驗，就會發現並不是文字向量，接下來使用的 Sys.time() 函數是不需要參數就會輸出電腦系統日期時間向量的函數。

sys_time <- Sys.time() # system time vector
print(sys_time)        # looks like a character
print(class(sys_time))
print(typeof(sys_time))

R 語言預設以西元 1970 年 1 月 1 日格林尼治標準時間（Greenwich Mean Time，GMT）00 時 00 分 00 秒作為 0，在這個時間點以後的每秒都 +1 來記錄，這個時間點以前的每秒都 -1 來記錄。文字向量與日期時間向量兩者最大的分野，在於日期時間向量可以進行數值運算，而文字向量不行。

print(sys_time - 1)
print(sys_time + 1)

建立未定義值向量

R 語言有豐富的未定義值向量，用來表示一些特殊情境所需要的表示，有遺漏值向量 NA、空值向量 NULL、非數值向量 NaN 以及無限大數值向量 Inf。建立這些未定義值向量就像建立邏輯向量一般，直接輸入保留字即可。

print(NA)
print(NULL)
print(NaN)
print(Inf)
print(-Inf)

在初步暸解如何在 R 語言建立基本資料單位「向量」之後，約維安計畫：建立基本資料單位「向量」來到尾聲，希望您也和我一樣期待下一篇文章。

延伸閱讀

對於這篇文章有什麼想法呢？喜歡😻、留言🙋‍♂️或者分享🙌

數聚點