modelr
disponibiliza funções que permitem o uso de modelos dentro dos fluxos de dados.model.fit <- lm(mpg ~ wt, data=mtcars) # fazemos uma regressão linear
ggplot(mtcars, aes(x=wt, y=mpg)) + theme_bw() +
geom_point() +
geom_abline(slope = coef(model.fit)[2], intercept = coef(model.fit)[1], color="red")
O pacote permite juntar a previsão e o residual à data frame original
mtcars %>%
add_predictions(model.fit) %>% # são adicionas novas colunas
add_residuals(model.fit) -> df
ggplot(df, aes(x=wt, y=mpg)) + theme_bw() +
geom_point() +
geom_abline(slope = coef(model.fit)[2], intercept = coef(model.fit)[1], color="red") +
geom_segment(aes(x=wt, y=pred, xend=wt, yend=pred+resid), color="orange")
O data frame gapminder tem um sumário da esperança média de vida e do PIB pe capita, ao longo dos anos, para vários países do mundo:
library(gapminder)
gapminder
# A tibble: 1,704 x 6
country continent year lifeExp pop gdpPercap
<fctr> <fctr> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
7 Afghanistan Asia 1982 39.854 12881816 978.0114
8 Afghanistan Asia 1987 40.822 13867957 852.3959
9 Afghanistan Asia 1992 41.674 16317921 649.3414
10 Afghanistan Asia 1997 41.763 22227415 635.3414
# ... with 1,694 more rows
Podemos fazer as regressões lineares para todos os paises, mas o resultado é obscuro por haver demasiados modelos:
gapminder %>%
ggplot(aes(year, lifeExp, group = country)) + theme_bw() +
geom_line(alpha = 1/3)
Para um país pode visualizar-se bem a regressão linear:
pt <- dplyr::filter(gapminder, country == "Portugal")
pt %>%
ggplot(aes(year, lifeExp)) + theme_bw() +
geom_line() +
ggtitle("Full data = ")
pt_mod <- lm(lifeExp ~ year, data = pt)
pt %>%
add_predictions(pt_mod) %>%
ggplot(aes(year, pred)) + theme_bw() +
geom_line() +
ggtitle("Linear trend + ")
pt %>%
add_residuals(pt_mod) %>%
ggplot(aes(year, resid)) + theme_bw() +
geom_hline(yintercept = 0, colour = "white", size = 3) +
ylim(-20, 20) +
geom_line() +
ggtitle("Remaining pattern")
Mas como fazer para todos os países?
A função nest
cria um data frame com os dados de cada grupo que recebe:
gapminder %>%
group_by(country, continent) %>%
tidyr::nest() -> by_country
by_country
# A tibble: 142 x 3
country continent data
<fctr> <fctr> <list>
1 Afghanistan Asia <tibble [12 x 4]>
2 Albania Europe <tibble [12 x 4]>
3 Algeria Africa <tibble [12 x 4]>
4 Angola Africa <tibble [12 x 4]>
5 Argentina Americas <tibble [12 x 4]>
6 Australia Oceania <tibble [12 x 4]>
7 Austria Europe <tibble [12 x 4]>
8 Bahrain Asia <tibble [12 x 4]>
9 Bangladesh Asia <tibble [12 x 4]>
10 Belgium Europe <tibble [12 x 4]>
# ... with 132 more rows
Vejamos a primeira data frame com os dados do Afeganistão:
by_country[1,]$data[[1]]
# A tibble: 12 x 4
year lifeExp pop gdpPercap
<int> <dbl> <int> <dbl>
1 1952 28.801 8425333 779.4453
2 1957 30.332 9240934 820.8530
3 1962 31.997 10267083 853.1007
4 1967 34.020 11537966 836.1971
5 1972 36.088 13079460 739.9811
6 1977 38.438 14880372 786.1134
7 1982 39.854 12881816 978.0114
8 1987 40.822 13867957 852.3959
9 1992 41.674 16317921 649.3414
10 1997 41.763 22227415 635.3414
11 2002 42.129 25268405 726.7341
12 2007 43.828 31889923 974.5803
Agora temos os dados que queremos modelar com regressão linear, organizados por linhas. Seja a seguinte função que aplica a regressão:
country_model <- function(df) {
lm(lifeExp ~ year, data = df)
}
Como aplicar a todos os países?
by_country %>%
dplyr::mutate(model = purrr::map(data, country_model)) -> by_country
Ou seja, colocámos os modelos na data frame inicial!
by_country
# A tibble: 142 x 4
country continent data model
<fctr> <fctr> <list> <list>
1 Afghanistan Asia <tibble [12 x 4]> <S3: lm>
2 Albania Europe <tibble [12 x 4]> <S3: lm>
3 Algeria Africa <tibble [12 x 4]> <S3: lm>
4 Angola Africa <tibble [12 x 4]> <S3: lm>
5 Argentina Americas <tibble [12 x 4]> <S3: lm>
6 Australia Oceania <tibble [12 x 4]> <S3: lm>
7 Austria Europe <tibble [12 x 4]> <S3: lm>
8 Bahrain Asia <tibble [12 x 4]> <S3: lm>
9 Bangladesh Asia <tibble [12 x 4]> <S3: lm>
10 Belgium Europe <tibble [12 x 4]> <S3: lm>
# ... with 132 more rows
Isto é especialmente útil porque mantemos tudo organizado no mesmo data frame.
Podemos juntar os resíduos de cada modelo na própria data frame:
by_country %>%
dplyr::mutate( resids = purrr::map2(data, model, modelr::add_residuals) ) -> by_country
by_country
# A tibble: 142 x 5
country continent data model resids
<fctr> <fctr> <list> <list> <list>
1 Afghanistan Asia <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
2 Albania Europe <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
3 Algeria Africa <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
4 Angola Africa <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
5 Argentina Americas <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
6 Australia Oceania <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
7 Austria Europe <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
8 Bahrain Asia <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
9 Bangladesh Asia <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
10 Belgium Europe <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
# ... with 132 more rows
Um exemplo de uso deste data frame: mostrar os residuos para cada continente:
by_country %>%
unnest(resids) %>% # transforma-a numa data frame simples
ggplot(aes(year, resid, group = country)) + theme_bw() +
geom_line(alpha = 1 / 3) +
facet_wrap(~continent)
O modelo linear não representa bem a dinâmica de Africa.
Podemos querer analisar a qualidade dos modelos que criámos. O pacote broom
ajuda-nos nessa tarefa, pois dá-nos funcionalidades de ir buscar informação aos modelos do R de forma organizada:
broom::glance(pt_mod) # glance() recolhe diversas medidas de qualidade
r.squared adj.r.squared sigma statistic p.value df logLik
1 0.9690351 0.9659386 1.139705 312.9461 7.097907e-09 2 -17.50257
AIC BIC deviance df.residual
1 41.00514 42.45986 12.98928 10
Então podemos adicionar estas medidas à nossa data frame:
by_country %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance) -> glance
glance
# A tibble: 142 x 16
country continent data model resids
<fctr> <fctr> <list> <list> <list>
1 Afghanistan Asia <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
2 Albania Europe <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
3 Algeria Africa <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
4 Angola Africa <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
5 Argentina Americas <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
6 Australia Oceania <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
7 Austria Europe <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
8 Bahrain Asia <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
9 Bangladesh Asia <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
10 Belgium Europe <tibble [12 x 4]> <S3: lm> <tibble [12 x 5]>
# ... with 132 more rows, and 11 more variables: r.squared <dbl>,
# adj.r.squared <dbl>, sigma <dbl>, statistic <dbl>, p.value <dbl>,
# df <int>, logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
# df.residual <int>
Vejamos por exemplo a medida \( R^2 \):
glance %>%
ggplot(aes(continent, r.squared)) + theme_bw() +
geom_jitter(width = 0.5)
Quais são os países onde a hipótese linear é menos adequada?
bad_fit <- dplyr::filter(glance, r.squared < 0.25)
gapminder %>%
dplyr::semi_join(bad_fit, by = "country") %>%
ggplot(aes(year, lifeExp, colour = country)) + theme_bw() +
geom_line()
Estes são efeitos da tragédia da SIDA e do genocídio no Ruanda.
O pacote também pode organizar os dados com uma linha por cada coeficiente do modelo:
by_country %>%
mutate(tidy = map(model, broom::tidy)) %>%
unnest(tidy)
# A tibble: 284 x 7
country continent term estimate std.error statistic
<fctr> <fctr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan Asia (Intercept) -507.5342716 40.484161954 -12.536613
2 Afghanistan Asia year 0.2753287 0.020450934 13.462890
3 Albania Europe (Intercept) -594.0725110 65.655359062 -9.048348
4 Albania Europe year 0.3346832 0.033166387 10.091036
5 Algeria Africa (Intercept) -1067.8590396 43.802200843 -24.379118
6 Algeria Africa year 0.5692797 0.022127070 25.727749
7 Angola Africa (Intercept) -376.5047531 46.583370599 -8.082385
8 Angola Africa year 0.2093399 0.023532003 8.895964
9 Argentina Americas (Intercept) -389.6063445 9.677729641 -40.258031
10 Argentina Americas year 0.2317084 0.004888791 47.395847
# ... with 274 more rows, and 1 more variables: p.value <dbl>
A função augment atribui a cada linha as respectivas estatísticas, como residuais:
by_country %>%
mutate(augm = map(model, broom::augment)) %>%
unnest(augm)
# A tibble: 1,704 x 11
country continent lifeExp year .fitted .se.fit .resid
<fctr> <fctr> <dbl> <int> <dbl> <dbl> <dbl>
1 Afghanistan Asia 28.801 1952 29.90729 0.6639995 -1.10629487
2 Afghanistan Asia 30.332 1957 31.28394 0.5799442 -0.95193823
3 Afghanistan Asia 31.997 1962 32.66058 0.5026799 -0.66358159
4 Afghanistan Asia 34.020 1967 34.03722 0.4358337 -0.01722494
5 Afghanistan Asia 36.088 1972 35.41387 0.3848726 0.67413170
6 Afghanistan Asia 38.438 1977 36.79051 0.3566719 1.64748834
7 Afghanistan Asia 39.854 1982 38.16716 0.3566719 1.68684499
8 Afghanistan Asia 40.822 1987 39.54380 0.3848726 1.27820163
9 Afghanistan Asia 41.674 1992 40.92044 0.4358337 0.75355828
10 Afghanistan Asia 41.763 1997 42.29709 0.5026799 -0.53408508
# ... with 1,694 more rows, and 4 more variables: .hat <dbl>,
# .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>