模仿# dplyr package
dplyr所有的功能都可以自己用R實現,但用dplyr有好處:使程式碼易讀,效能上通常也比自己寫的快 (dplyr在許多需要運算的部分用C++寫)。
dplyr的函式通則是這樣子的:Data_frame Function(Data_frame, arg1, arg2...)
,函式接受一個data frame作為第一個參數而其他參數接在後面,執行結束之後會回傳結果data frame,可以把回傳的data.frame指定給其他變數 (但是下面的範例因為之後都不會用到回傳的data.frame,所以都直接head()
而不指定給變數,這裡新手容易confuse所以特別說明一下...),下面列出較常用的。
Function | Description |
---|---|
select() | Select by:Specify、Slice、Matched pattern、Omit |
filter() | Select by:Given Condition |
arrange() | Sort by:Ascent/Decent |
rename() | Rename some column |
mutate()/transmute() | Create new variable from expression |
group_by() | create a separate data frame that splits the original data frame by specified variable. |
summarize() /summarise() | |
%>% | Pipeline operator |
下面都是用dplyr內的dataset範例做練習,請先install.packages("nycflights13")
,然後library("nycflights13")
之後會自動產生一個叫做flight的data.frame,才能執行下面函式裡面那些範例。
#引入data set
> library("nycflights13")
#先看一下這個data set的欄位名稱
> names("ncyflights13")
NULL
> names(flights)
[1] "year" "month" "day" "dep_time" "sched_dep_time" "dep_delay"
[7] "arr_time" "sched_arr_time" "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance" "hour" "minute"
[19] "time_hour"
#先查看一下資料長什麼樣子
> head(flights)
# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
1 2013 1 1 517 515 2 830 819 11 UA 1545
2 2013 1 1 533 529 4 850 830 20 UA 1714
3 2013 1 1 542 540 2 923 850 33 AA 1141
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725
5 2013 1 1 554 600 -6 812 837 -25 DL 461
6 2013 1 1 554 558 -4 740 728 12 UA 1696
# ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Select
select()
選定所要的欄位,用法:
- Specify (
select(dataframe, cloumnNameA)
) - Slice (
select(dateframe, cloumnNameA:cloumnNameB)
) - Omit (
select(dataframe, -(Specify/Slice/pattern matched/...))
) pattern matched (
select(dataframe, selectHelpers(pattern))
)常見的Select helpers有:
starts_with(string)
,選取欄位名稱以string開頭的欄位。ends_with(string)
,選取欄位名稱以string結尾的欄位。contains(string)
,選取欄位名稱包含某string的欄位。matches(regex)
,跟contains()相似,但是可以用regex。one_of(stringVector)
選出名稱可在string vector內找到的欄位。everything()
,代表除了已經指定的欄位之外其他所有欄位,用來將特定欄位提取到最前方。
#Specify,可以指定多個column
> head(select(flights, year, day))
# A tibble: 6 × 2
year day
<int> <int>
1 2013 1
2 2013 1
3 2013 1
4 2013 1
5 2013 1
6 2013 1
#Slice,可以用column name或column index取
> head(select(flights, year:day))
# A tibble: 6 × 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
#Omit
> head(select(flights, -(year:distance)))
# A tibble: 6 × 3
hour minute time_hour
<dbl> <dbl> <dttm>
1 5 15 2013-01-01 05:00:00
2 5 29 2013-01-01 05:00:00
3 5 40 2013-01-01 05:00:00
4 5 45 2013-01-01 05:00:00
5 6 0 2013-01-01 06:00:00
6 5 58 2013-01-01 05:00:00
#pattern matched
#starts_with()與ends_with()
#只拿一個來示範,另一個可以自己做練習
#選取名稱以"a"開頭的欄位
> head(select(flights, starts_with("a")))
# A tibble: 6 × 3
arr_time arr_delay air_time
<int> <dbl> <dbl>
1 830 11 227
2 850 20 227
3 923 33 160
4 1004 -18 183
5 812 -25 116
6 740 12 150
#contains()與matches()
#先拿contains來示範,matches()看完regex可依樣做練習
#選取名稱包含"time"的欄位
> head(select(flights, contains("time")))
# A tibble: 6 × 6
dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
<int> <int> <int> <int> <dbl> <dttm>
1 517 515 830 819 227 2013-01-01 05:00:00
2 533 529 850 830 227 2013-01-01 05:00:00
3 542 540 923 850 160 2013-01-01 05:00:00
4 544 545 1004 1022 183 2013-01-01 05:00:00
5 554 600 812 837 116 2013-01-01 06:00:00
6 554 558 740 728 150 2013-01-01 05:00:00
#one_of()的演示
> head(select(flights, one_of(c("month", "dest", "abc"))))
# A tibble: 6 × 2
month dest
<int> <chr>
1 1 IAH
2 1 IAH
3 1 MIA
4 1 BQN
5 1 ATL
6 1 ORD
Warning message:
In one_of(c("month", "dest", "abc")) : Unknown variables: `abc`
#everything()代表除了已經指定的欄位之外其他所有欄位。
#下面的例子是把dest、air_time拿到最前面,然後其他的接在後面。
> head(select(flights, dest, air_time, everything()), 3)
# A tibble: 3 × 19
dest air_time year month day dep_time sched_dep_time dep_delay arr_time
<chr> <dbl> <int> <int> <int> <int> <int> <dbl> <int>
1 IAH 227 2013 1 1 517 515 2 830
2 IAH 227 2013 1 1 533 529 4 850
3 MIA 160 2013 1 1 542 540 2 923
# ... with 10 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Filter
filter()
用法是filter(data_frame, condition)
,功能與R內建的subset相似但較快。
#example:取出arr_delay 10 以上的資料,可以用&(AND)跟|(OR)串聯條件。
> x <- filter(flights, arr_delay > 10 & carrier == "AA")
#可看出篩選過的欄位統計結果分布明顯不一樣
> summary(flights$arr_delay)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-86.000 -17.000 -5.000 6.895 14.000 1272.000 9430
> summary(x$arr_delay)
Min. 1st Qu. Median Mean 3rd Qu. Max.
11.00 19.00 33.00 54.63 68.00 1007.00
#看看篩選過的x,arr_delay都 > 10且carrier都是AA
> head(x)
# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
1 2013 1 1 542 540 2 923 850 33 AA 1141
2 2013 1 1 559 600 -1 941 910 31 AA 707
3 2013 1 1 629 630 -1 824 810 14 AA 303
4 2013 1 1 635 635 0 1028 940 48 AA 711
5 2013 1 1 724 730 -6 1111 1040 31 AA 715
6 2013 1 1 725 730 -5 1052 1040 12 AA 2083
# ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Arrange
arrange()
根據某一欄排序,用法:
arrange(data_frame, column_name)
預設就是用ascent排序
arrange(data_frame, desc(column_name))
如果要改descent排序可以加入desc():
#預設為升冪排序
> head(arrange(flights, sched_dep_time))
# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
1 2013 7 27 NA 106 NA NA 245 NA US 1632
2 2013 1 2 458 500 -2 703 650 13 US 1030
3 2013 1 3 458 500 -2 650 650 0 US 1030
4 2013 1 4 456 500 -4 631 650 -19 US 1030
5 2013 1 5 458 500 -2 640 650 -10 US 1030
6 2013 1 6 458 500 -2 718 650 28 US 1030
# ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
#降冪排序
> head(arrange(flights, desc(sched_dep_time)))
# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
1 2013 1 1 2353 2359 -6 425 445 -20 B6 739
2 2013 1 1 2353 2359 -6 418 442 -24 B6 707
3 2013 1 1 2356 2359 -3 425 437 -12 B6 727
4 2013 1 2 42 2359 43 518 442 36 B6 707
5 2013 1 2 2351 2359 -8 427 445 -18 B6 739
6 2013 1 2 2354 2359 -5 413 437 -24 B6 727
# ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Rename
把某些column重新命名,用法是:
rename(dataSet, newName1 = oldName1, newName2 = oldName2, ...)
#先查看一下原本的column name
> names(flights)
[1] "year" "month" "day" "dep_time" "sched_dep_time" "dep_delay"
[7] "arr_time" "sched_arr_time" "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance" "hour" "minute"
[19] "time_hour"
#記住新名子跟舊名子不要搞錯邊,不然會像我一樣發生下面的錯誤。
> names(rename(flights, year = aaa, month = bbb))
Error: Unknown variables: aaa, bbb.
#用names查看,column name果然改了
> names(rename(flights, aaa = year, bbb = month))
[1] "aaa" "bbb" "day" "dep_time" "sched_dep_time" "dep_delay"
[7] "arr_time" "sched_arr_time" "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance" "hour" "minute"
[19] "time_hour"
Mutate/Transmute
根據表達式產生的值新增欄位,兩者的不同是:
Mutate
回傳包含新欄位的dataset
Transmute
回傳只有新欄位的dataset
#Mutate的操作方法
> x <- mutate(flights, newColumn1 = month + day, newColumn2 = month - day)
> names(x)
[1] "year" "month" "day" "dep_time" "sched_dep_time"
[6] "dep_delay" "arr_time" "sched_arr_time" "arr_delay" "carrier"
[11] "flight" "tailnum" "origin" "dest" "air_time"
[16] "distance" "hour" "minute" "time_hour" "newColumn1"
[21] "newColumn2"
> head(select(x, starts_with("new")))
# A tibble: 6 × 2
newColumn1 newColumn2
<int> <int>
1 2 0
2 2 0
3 2 0
4 2 0
5 2 0
6 2 0
#可以看到改用transmute會捨棄其他的表格
> x <- transmute(flights, newColumn1 = month + day, newColumn2 = month - day)
> names(x)
[1] "newColumn1" "newColumn2"
Group_by、Summarize/Summarise
group_by()
把dataset依照某column的值分階層(stratum),之在Summarize的時候就可以自動輸出不同group的統計結果,而事實上
group_by()
後面通常都接Summarize()
做使用。Summarize
對1個多個欄位資料做總結描述,每種物件型態不同,物件所定義的描述方式也不同。
#依照月份分group
> x <- group_by(flights, month)
#來看看哪個月的平均飛行距離最遠吧~
summarize(x, dist = mean(distance, na.rm = TRUE))
# A tibble: 12 × 2
month dist
<int> <dbl>
1 1 1006.844
2 2 1000.982
3 3 1011.987
4 4 1038.733
5 5 1040.913
6 6 1057.125
7 7 1058.596
8 8 1062.138
9 9 1041.250
10 10 1038.876
11 11 1050.305
12 12 1064.656
%>%Pipeline operator
其實就是跟Linux裡面的|
做的語法糖,用來把函式之間的層層嵌套以串流形式來表達。
#下面的情形是B的回傳值會直接作為A的引數。
A(B())
#可以改成這樣,表示B的回傳值pass給A作為引數。
B() %>% A()