2016
O pacote tidyr
é utilizado para preparar data frames que se encontram muitas vezes num formato inapropriado para análise.
library(readr)
library(tidyr)
TB <- read_csv("data/tb.csv")
head(TB,6)
Source: local data frame [6 x 23]
iso2 year new_sp new_sp_m04 new_sp_m514 new_sp_m014 new_sp_m1524
(chr) (int) (int) (int) (int) (int) (int)
1 AD 1989 NA NA NA NA NA
2 AD 1990 NA NA NA NA NA
3 AD 1991 NA NA NA NA NA
4 AD 1992 NA NA NA NA NA
5 AD 1993 15 NA NA NA NA
6 AD 1994 24 NA NA NA NA
Variables not shown: new_sp_m2534 (int), new_sp_m3544 (int), new_sp_m4554
(int), new_sp_m5564 (int), new_sp_m65 (int), new_sp_mu (int), new_sp_f04
(int), new_sp_f514 (int), new_sp_f014 (int), new_sp_f1524 (int),
new_sp_f2534 (int), new_sp_f3544 (int), new_sp_f4554 (int), new_sp_f5564
(int), new_sp_f65 (int), new_sp_fu (int)
Neste caso nota-se que cada coluna é uma combinação de dados que podem ser separados de forma mais conveniente.
Este pacote permite que manipulemos este género de tabelas para criar uma estrutura mais organizada:
TB %>%
gather("column", "cases", 3:23) %>%
separate("column", c("new", "var", "sexage")) %>%
separate("sexage", c("sex", "age"), 1) -> TB.tidy
TB.tidy[30500:30510,]
# A tibble: 11 x 7
iso2 year new var sex age cases
<chr> <int> <chr> <chr> <chr> <chr> <int>
1 FI 1981 new sp m 2534 NA
2 FI 1982 new sp m 2534 NA
3 FI 1983 new sp m 2534 NA
4 FI 1984 new sp m 2534 NA
5 FI 1985 new sp m 2534 NA
6 FI 1986 new sp m 2534 NA
7 FI 1987 new sp m 2534 NA
8 FI 1988 new sp m 2534 NA
9 FI 1989 new sp m 2534 NA
10 FI 1990 new sp m 2534 NA
11 FI 1991 new sp m 2534 NA
Vamos explorar como funcionam estas funções.
Uma tabela considera-se 'arrumada' (tidy) se cada variável é colocada numa coluna, cada observação numa linha, e cada valor numa célula.
Estes são exemplos da tabelas não arrumadas:
O pacote usa o conceito de chave – valor. A chave dá o contexto ao valor.
Por exemplo, o par “mês: 11” diz-nos que o valor 11 refere-se ao mês do ano, i.e., Novembro. Sem o contexto, o valor 11 seria ambíguo.
Cada célula de uma tabela arrumada contém o valor, enquanto o nome da respectiva coluna identifica a chave:
A função spread()
transforma pares de colunas em colunas arrumadas.
# install.packages("devtools")
# devtools::install_github("garrettgman/DSR")
library(DSR)
head(table2, 3)
# A tibble: 3 x 4
country year key value
<fctr> <int> <fctr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
head(spread(table2,key,value), 3)
# A tibble: 3 x 4
country year cases population
<fctr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
Possui três argumentos: convert=FALSE
que converte strings para tipos mais apropriados; drop=TRUE
que preenche valores que não aparecem na tabela original com o valor dado no 3º argumento fill=NA
.
Um exemplo:
df <- data.frame(n=c("ana","ana","ana","rui","rui"), x=c("a","b","c","a","b"), y=1:5)
df
n x y
1 ana a 1
2 ana b 2
3 ana c 3
4 rui a 4
5 rui b 5
spread(df, x, y, fill=-1)
n a b c
1 ana 1 2 3
2 rui 4 5 -1
A função gather()
faz o inverso de spread()
: transforma um conjunto de colunas numa coluna de chaves, colocando os valores das células numa coluna de valores.
table4
Source: local data frame [3 x 3]
country 1999 2000
(fctr) (int) (int)
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
gather(table4, "year", "cases", 2:3)
Source: local data frame [6 x 3]
country year cases
(fctr) (chr) (int)
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
4 Afghanistan 2000 2666
5 Brazil 2000 80488
6 China 2000 213766
A função separate()
parte uma coluna em diversas colunas a partir de um separador:
table3
Source: local data frame [6 x 3]
country year rate
(fctr) (int) (chr)
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
separate(table3, "rate", into=c("cases", "population"), sep="/", convert=TRUE)
Source: local data frame [6 x 4]
country year cases population
(fctr) (int) (int) (int)
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
A função pode separar numa qualquer posição:
table3
Source: local data frame [6 x 3]
country year rate
(fctr) (int) (chr)
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
separate(table3, "rate", into=c("col1", "col2", "col3"), sep=c(2,5))
Source: local data frame [6 x 5]
country year col1 col2 col3
(fctr) (int) (chr) (chr) (chr)
1 Afghanistan 1999 74 5/1 9987071
2 Afghanistan 2000 26 66/ 20595360
3 Brazil 1999 37 737 /172006362
4 Brazil 2000 80 488 /174504898
5 China 1999 21 225 8/1272915272
6 China 2000 21 376 6/1280428583
A função unite()
faz o contrário, funde colunas numa só:
table6
Source: local data frame [6 x 4]
country century year rate
(fctr) (chr) (chr) (chr)
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583
unite(table6, "new.year", century, year, sep="")
Source: local data frame [6 x 3]
country new.year rate
(fctr) (chr) (chr)
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
TB <- read_csv("data/tb.csv")
TB
Source: local data frame [5,769 x 23]
iso2 year new_sp new_sp_m04 new_sp_m514 new_sp_m014 new_sp_m1524
(chr) (int) (int) (int) (int) (int) (int)
1 AD 1989 NA NA NA NA NA
2 AD 1990 NA NA NA NA NA
3 AD 1991 NA NA NA NA NA
4 AD 1992 NA NA NA NA NA
5 AD 1993 15 NA NA NA NA
6 AD 1994 24 NA NA NA NA
7 AD 1996 8 NA NA 0 0
8 AD 1997 17 NA NA 0 0
9 AD 1998 1 NA NA 0 0
10 AD 1999 4 NA NA 0 0
.. ... ... ... ... ... ... ...
Variables not shown: new_sp_m2534 (int), new_sp_m3544 (int), new_sp_m4554
(int), new_sp_m5564 (int), new_sp_m65 (int), new_sp_mu (int), new_sp_f04
(int), new_sp_f514 (int), new_sp_f014 (int), new_sp_f1524 (int),
new_sp_f2534 (int), new_sp_f3544 (int), new_sp_f4554 (int), new_sp_f5564
(int), new_sp_f65 (int), new_sp_fu (int)
new
significa se é um novo caso de tuberculose, as duas letras seguintes representam o tipo de caso de TB, a 6ª letra é o sexo do paciente, e os restantes números dizem-nos a faixa etária.
A tabela original não está arrumada. Podemos arrumá-la desta forma:
TB %>%
gather("column", "cases", 3:23) %>%
separate("column", c("new", "var", "sexage")) %>%
separate("sexage", c("sex", "age"), 1)
Source: local data frame [121,149 x 7]
iso2 year new var sex age cases
(chr) (int) (chr) (chr) (chr) (chr) (int)
1 AD 1989 new sp NA NA NA
2 AD 1990 new sp NA NA NA
3 AD 1991 new sp NA NA NA
4 AD 1992 new sp NA NA NA
5 AD 1993 new sp NA NA 15
6 AD 1994 new sp NA NA 24
7 AD 1996 new sp NA NA 8
8 AD 1997 new sp NA NA 17
9 AD 1998 new sp NA NA 1
10 AD 1999 new sp NA NA 4
.. ... ... ... ... ... ... ...
pew <- read_csv("data/pew.csv")
pew
Source: local data frame [18 x 11]
religion <10k 10-20k 20-30k 30-40k 40-50k 50-75k
(chr) (int) (int) (int) (int) (int) (int)
1 Agnostic 27 34 60 81 76 137
2 Atheist 12 27 37 52 35 70
3 Buddhist 27 21 30 34 33 58
4 Catholic 418 617 732 670 638 1116
5 Don<U+0092>t know/refused 15 14 15 11 10 35
6 Evangelical Prot 575 869 1064 982 881 1486
7 Hindu 1 9 7 9 11 34
8 Historically Black Prot 228 244 236 238 197 223
9 Jehovah's Witness 20 27 24 24 21 30
10 Jewish 19 19 25 25 30 95
11 Mainline Prot 289 495 619 655 651 1107
12 Mormon 29 40 48 51 56 112
13 Muslim 6 7 9 10 9 23
14 Orthodox 13 17 23 32 32 47
15 Other Christian 9 7 11 13 13 14
16 Other Faiths 20 33 40 46 49 63
17 Other World Religions 5 2 3 4 2 7
18 Unaffiliated 217 299 374 365 341 528
Variables not shown: 75-100k (int), 100-150k (int), >150k (int), Don't
know/refused (int)
Para arrumá-la:
pew %>%
gather("income", "frequency", -religion)
Source: local data frame [180 x 3]
religion income frequency
(chr) (chr) (int)
1 Agnostic <10k 27
2 Atheist <10k 12
3 Buddhist <10k 27
4 Catholic <10k 418
5 Don<U+0092>t know/refused <10k 15
6 Evangelical Prot <10k 575
7 Hindu <10k 1
8 Historically Black Prot <10k 228
9 Jehovah's Witness <10k 20
10 Jewish <10k 19
.. ... ... ...
A expressão -religion
representa todas as colunas excepto a coluna religion
.
Para efectuar manipulações mais complexas devemos também usar o pacote dplyr
.
Esta tabela tem variáveis guardadas nas colunas e nas linhas:
library(readr)
library(tidyr)
library(dplyr)
weather <- read_csv("data/weather.csv", col_names = FALSE)
names(weather) <- c("header", paste0("d",1:32))
weather
# A tibble: 1,714 x 33
header d1 d2 d3 d4 d5 d6 d7 d8
<chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 MX000017004195504TMAX 310 310 310 320 330 320 320 330
2 MX000017004195504TMIN 150 150 160 150 160 160 160 160
3 MX000017004195504PRCP 0 0 0 0 0 0 0 0
4 MX000017004195505TMAX 310 310 310 300 300 300 310 310
5 MX000017004195505TMIN 200 160 160 150 150 150 160 160
6 MX000017004195505PRCP 0 0 0 0 0 0 0 0
7 MX000017004195506TMAX 300 290 280 270 280 260 230 270
8 MX000017004195506TMIN 160 160 150 140 160 150 150 150
9 MX000017004195506PRCP 103 435 422 0 243 17 0 0
10 MX000017004195507TMAX 270 270 260 260 240 240 190 210
# ... with 1,704 more rows, and 24 more variables: d9 <chr>, d10 <chr>,
# d11 <chr>, d12 <chr>, d13 <chr>, d14 <chr>, d15 <chr>, d16 <chr>,
# d17 <chr>, d18 <chr>, d19 <chr>, d20 <chr>, d21 <chr>, d22 <chr>,
# d23 <chr>, d24 <chr>, d25 <chr>, d26 <chr>, d27 <chr>, d28 <chr>,
# d29 <chr>, d30 <chr>, d31 <chr>, d32 <chr>
A 1ª coluna tem informação sobre a data e se os dados referem-se à temperatura mínima ou máxima (entre outros tipos de informação).
Primeiro partimos a 1ª coluna em partes:
weather %>%
separate("header", c("id","year","month","elem"), c(11,15,17), convert=TRUE)
# A tibble: 1,714 x 36
id year month elem d1 d2 d3 d4 d5 d6 d7
* <chr> <int> <int> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
1 MX000017004 1955 4 TMAX 310 310 310 320 330 320 320
2 MX000017004 1955 4 TMIN 150 150 160 150 160 160 160
3 MX000017004 1955 4 PRCP 0 0 0 0 0 0 0
4 MX000017004 1955 5 TMAX 310 310 310 300 300 300 310
5 MX000017004 1955 5 TMIN 200 160 160 150 150 150 160
6 MX000017004 1955 5 PRCP 0 0 0 0 0 0 0
7 MX000017004 1955 6 TMAX 300 290 280 270 280 260 230
8 MX000017004 1955 6 TMIN 160 160 150 140 160 150 150
9 MX000017004 1955 6 PRCP 103 435 422 0 243 17 0
10 MX000017004 1955 7 TMAX 270 270 260 260 240 240 190
# ... with 1,704 more rows, and 25 more variables: d8 <chr>, d9 <chr>,
# d10 <chr>, d11 <chr>, d12 <chr>, d13 <chr>, d14 <chr>, d15 <chr>,
# d16 <chr>, d17 <chr>, d18 <chr>, d19 <chr>, d20 <chr>, d21 <chr>,
# d22 <chr>, d23 <chr>, d24 <chr>, d25 <chr>, d26 <chr>, d27 <chr>,
# d28 <chr>, d29 <chr>, d30 <chr>, d31 <chr>, d32 <chr>
A seguir juntamos as colunas dos vários dias numa única coluna:
weather %>%
separate("header", c("id","year","month","elem"), c(11,15,17), convert=TRUE) %>%
gather(day, value, d1:d31, na.rm=TRUE)
# A tibble: 52,967 x 7
id year month elem d32 day value
* <chr> <int> <int> <chr> <chr> <chr> <chr>
1 MX000017004 1955 4 TMAX <NA> d1 310
2 MX000017004 1955 4 TMIN <NA> d1 150
3 MX000017004 1955 4 PRCP <NA> d1 0
4 MX000017004 1955 5 TMAX I d1 310
5 MX000017004 1955 5 TMIN I d1 200
6 MX000017004 1955 5 PRCP I d1 0
7 MX000017004 1955 6 TMAX <NA> d1 300
8 MX000017004 1955 6 TMIN <NA> d1 160
9 MX000017004 1955 6 PRCP <NA> d1 103
10 MX000017004 1955 7 TMAX I d1 270
# ... with 52,957 more rows
Convertemos as colunas day
e value
em valores numéricos:
weather %>%
separate("header", c("id","year","month","elem"), c(11,15,17), convert=TRUE) %>%
gather(day, value, d1:d31, na.rm=TRUE) %>%
mutate(day = extract_numeric(day)) %>%
mutate(value = extract_numeric(value))
# A tibble: 52,967 x 7
id year month elem d32 day value
<chr> <int> <int> <chr> <chr> <dbl> <dbl>
1 MX000017004 1955 4 TMAX <NA> 1 310
2 MX000017004 1955 4 TMIN <NA> 1 150
3 MX000017004 1955 4 PRCP <NA> 1 0
4 MX000017004 1955 5 TMAX I 1 310
5 MX000017004 1955 5 TMIN I 1 200
6 MX000017004 1955 5 PRCP I 1 0
7 MX000017004 1955 6 TMAX <NA> 1 300
8 MX000017004 1955 6 TMIN <NA> 1 160
9 MX000017004 1955 6 PRCP <NA> 1 103
10 MX000017004 1955 7 TMAX I 1 270
# ... with 52,957 more rows
Arrumamos os valores da coluna elem
– que determina as temperaturas mínima e máxima, entre outros dados – em colunas próprias:
weather %>%
separate("header", c("id","year","month","elem"), c(11,15,17), convert=TRUE) %>%
gather(day, value, d1:d31, na.rm=TRUE) %>%
mutate(day = extract_numeric(day)) %>%
mutate(value = extract_numeric(value)) %>%
spread(elem, value)
# A tibble: 27,978 x 11
id year month d32 day PRCP PRCP-9999 TMAX TMAX-9999
* <chr> <int> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 MX000017004 1955 4 <NA> 1 0 NA 310 NA
2 MX000017004 1955 4 <NA> 2 0 NA 310 NA
3 MX000017004 1955 4 <NA> 3 0 NA 310 NA
4 MX000017004 1955 4 <NA> 4 0 NA 320 NA
5 MX000017004 1955 4 <NA> 5 0 NA 330 NA
6 MX000017004 1955 4 <NA> 6 0 NA 320 NA
7 MX000017004 1955 4 <NA> 7 0 NA 320 NA
8 MX000017004 1955 4 <NA> 8 0 NA 330 NA
9 MX000017004 1955 4 <NA> 9 0 NA 330 NA
10 MX000017004 1955 4 <NA> 10 0 NA 330 NA
# ... with 27,968 more rows, and 2 more variables: TMIN <dbl>,
# TMIN-9999 <dbl>
Finalmente, escolhemos as colunas que nos interessam:
weather %>%
separate("header", c("id","year","month","elem"), c(11,15,17), convert=TRUE) %>%
gather(day, value, d1:d31, na.rm=TRUE) %>%
mutate(day = extract_numeric(day)) %>%
mutate(value = extract_numeric(value)) %>%
spread(elem, value) %>%
dplyr::select(id, year, month, day, TMIN, TMAX)
# A tibble: 27,978 x 6
id year month day TMIN TMAX
* <chr> <int> <int> <dbl> <dbl> <dbl>
1 MX000017004 1955 4 1 150 310
2 MX000017004 1955 4 2 150 310
3 MX000017004 1955 4 3 160 310
4 MX000017004 1955 4 4 150 320
5 MX000017004 1955 4 5 160 330
6 MX000017004 1955 4 6 160 320
7 MX000017004 1955 4 7 160 320
8 MX000017004 1955 4 8 160 330
9 MX000017004 1955 4 9 160 330
10 MX000017004 1955 4 10 170 330
# ... with 27,968 more rows
Esta tabela não está arrumada, ela tem dados nos nomes das colunas (o número da semana):
bb <- read_csv("data/billboard.csv")
names(bb) <- c(names(bb[1:7]),paste0("wk",1:76))
head(bb,6)
# A tibble: 6 x 83
year artist.inverted track time
<int> <chr> <chr> <chr>
1 2000 Destiny's Child Independent Women Part I 3:38
2 2000 Santana Maria, Maria 4:18
3 2000 Savage Garden I Knew I Loved You 4:07
4 2000 Madonna Music 3:45
5 2000 Aguilera, Christina Come On Over Baby (All I Want Is You) 3:38
6 2000 Janet Doesn't Really Matter 4:17
# ... with 79 more variables: genre <chr>, date.entered <date>,
# date.peaked <date>, wk1 <int>, wk2 <int>, wk3 <int>, wk4 <int>,
# wk5 <int>, wk6 <int>, wk7 <int>, wk8 <int>, wk9 <int>, wk10 <int>,
# wk11 <int>, wk12 <int>, wk13 <int>, wk14 <int>, wk15 <int>,
# wk16 <int>, wk17 <int>, wk18 <int>, wk19 <int>, wk20 <int>,
# wk21 <int>, wk22 <int>, wk23 <int>, wk24 <int>, wk25 <int>,
# wk26 <int>, wk27 <int>, wk28 <int>, wk29 <int>, wk30 <int>,
# wk31 <int>, wk32 <int>, wk33 <int>, wk34 <int>, wk35 <int>,
# wk36 <int>, wk37 <int>, wk38 <int>, wk39 <int>, wk40 <int>,
# wk41 <int>, wk42 <int>, wk43 <int>, wk44 <int>, wk45 <int>,
# wk46 <int>, wk47 <int>, wk48 <int>, wk49 <int>, wk50 <int>,
# wk51 <int>, wk52 <int>, wk53 <int>, wk54 <int>, wk55 <int>,
# wk56 <int>, wk57 <int>, wk58 <int>, wk59 <int>, wk60 <int>,
# wk61 <int>, wk62 <int>, wk63 <int>, wk64 <int>, wk65 <int>,
# wk66 <chr>, wk67 <chr>, wk68 <chr>, wk69 <chr>, wk70 <chr>,
# wk71 <chr>, wk72 <chr>, wk73 <chr>, wk74 <chr>, wk75 <chr>, wk76 <chr>
Arrumar a tabela para ficar desta forma:
# A tibble: 15 x 6
artist date rank genre time
<chr> <date> <chr> <chr> <chr>
1 2 Pac 2000-02-26 87 Rap 4:22
2 2 Pac 2000-03-04 82 Rap 4:22
3 2 Pac 2000-03-11 72 Rap 4:22
4 2 Pac 2000-03-18 77 Rap 4:22
5 2 Pac 2000-03-25 87 Rap 4:22
6 2 Pac 2000-04-01 94 Rap 4:22
7 2 Pac 2000-04-08 99 Rap 4:22
8 2Ge+her 2000-09-02 91 R&B 3:15
9 2Ge+her 2000-09-09 87 R&B 3:15
10 2Ge+her 2000-09-16 92 R&B 3:15
11 3 Doors Down 2000-04-08 81 Rock 3:53
12 3 Doors Down 2000-04-15 70 Rock 3:53
13 3 Doors Down 2000-04-22 68 Rock 3:53
14 3 Doors Down 2000-04-29 67 Rock 3:53
15 3 Doors Down 2000-05-06 66 Rock 3:53
# ... with 1 more variables: track <chr>
Dica: a coluna date
pode ser calculada por:
mutate(date = as.Date(date.entered) + 7 * (week - 1))