--- title: "Usage of newsanchor" author: "Lars Schulze " date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Usage of newsanchor} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r global_options, include=FALSE} knitr::opts_chunk$set(warning=FALSE, message=FALSE, eval = FALSE, fig.align="center") ``` ```{r libraries} library(newsanchor) ```
This is an introductory vignette to the **newsanchor** package. The package connects to https://newsapi.org/. **News API** is a simple HTTP REST API for searching and retrieving (live) articles from all over the web. You can get breaking news headlines, and search for articles from over 30,000 news sources and blogs.
The API can generally be used for free for non-commercial use cases. However, please be aware, that this results in some restrictions. For instance, you cannot download more than 1,000 results in a given search. Please see https://newsapi.org/pricing for commercial options.

The package helps you to answer questions like the following:
- What are the **top stories** running right now in the world? - What **articles** were published about your favorite politician, movie star, singer, or celebrity **today**? - What are the breaking news in **business**, **entertainment**, or **sports**? The package is publicly available at GitHub and hopefully soon at CRAN as well. In this vignette we demonstrate basic features of the package. In a second vignette, we show how the package can be used to scrape your results from an example online newspaper (i.e., New York Times).

### Authentication You need an API key to access https://newsapi.org. You can apply for a key at https://newsapi.org/register/. The function below appends the key automatically to your R environment file. Hence, every time you start R, the key is loaded. The functions `get_headlines`, `get_headlines_all`, `get_everything`, `get_everything_all`, and `get_sources` access your key automatically by executing `Sys.getenv("NEWS_API_KEY")`. Alternatively, you can provide an explicit definition of your api_key with each function call. ```{r set_api_key} # save the api_key in the .Renviron file set_api_key(api_key = "YOUR API KEY", path = "~/.Renviron") ```
### Functions: get_headlines() / get_headlines_all() These functions connect to the API's endpoint that returns breaking news headlines for a country and category, or currently running on a single or multiple sources. This is perfect if you want to download live up-to-date news headlines and images. You can provide **specific news sources**, **categories**, **countries**, and/or **queries**. For valid searchterms see the respective paragraph below. ```{r headlines: examples} # get headlines published by the Washington Post results <- get_headlines(sources = "the-washington-post") # get headlines published in the category sports results <- get_headlines(category = "sports") # get headlines published in Germany results <- get_headlines(country = "de") # get headlines published about Trump results <- get_headlines(query = "Trump") ``` *Limitations*
- Please note, only **one** searchterm can be used for `category`, `country`, or `query`. Additional elements will be ignored. - In addition, `sources` must not be combined with `country` and/or `category`.

*Additional settings*
Regarding `get_headlines()`, the default of https://newsapi.org only allows to get a maximum of 100 results per request. Since some search terms might yield more results, you can use the option **page** to browse through your results (per default: page = 1). You can change the number of results returned per request using **page_size** (per default: page_size = 100). ```{r headlines: page, page_size} results <- get_headlines(category = "sports", page = 2) results <- get_headlines(category = "sports", page = 3, page_size = 20) ``` *get_headlines_all*
To automatically download all results, use `get_headlines_all()`. This function is build around `get_headlines` and provides the same options (except for `page` and `page_size`). It returns a data frame with all results. ```{r headlines: get_all} results <- get_headlines_all(category = "sports") ``` *Searchterms*
We provide different dataframes :`terms_category`, `terms_sources`, and `terms_country`. They contain information on valid search terms. Access them if you are not sure whether some country or some sources will work. ```{r headlines: searchterms} terms_category terms_country terms_sources ```
### Functions: get_everything() / get_everything_all() This endpoint searches through articles from large and small news sources and blogs. This includes news as well as other kinds of normal articles. This function needs a compulsory `query` with each function call. You can make your search more fine-tuned using the following additional options:
- surround entire phrases with quotes ("") for exact matches. - prepend words/phrases that must appear with "+" symbol (e.g., +bitcoin). - prepend words that must not appear with "-" symbol (e.g., -bitcoin). - you can also use AND, OR, NOT keywords (optionally grouped with parenthesis, e.g., 'crypto AND (ethereum OR litecoin) NOT bitcoin)'). ```{r everything: search and advanced search} # get everything published about Trump results <- get_everything(query = "Trump") # get everything published with the phrase "Trump says" results <- get_everything(query = "Trump says +migrants") # get everything published with the phrase "Trump" and "migrants" but without "Mexico" results <- get_everything(query = "Trump +migrants -Mexico") ```
As stated above, you always have to provide a `query` with the function call. However, you can limit your results to multiple **sources**, different **languages**, or different **domains**. You are also able to **exclude_domains** or mark the start (**from**) or end (**to**) time of your search. ```{r everything: examples} # get everything published about Trump results <- get_everything(query = "Trump") # get everything published about Trump in the Washington Post results <- get_everything(query = "Trump", sources = "the-washington-post") # get everything published about Trump in french results <- get_everything(query = "Trump", languages = "fr") # get everything published about Trump at bbc.com results <- get_everything(query = "Trump", domains = "bbc.com") # get everything published about Trump BUT NOT at bbc.com results <- get_everything(query = "Trump", exclude_domains = "bbc.com") # get everything published about Trump BUT NOT at bbc.com results <- get_everything(query = "Trump", from = "2018-09-08") ``` *Limitations*
- If you use the option `sources`, there is a limitation to 20 sources. - The options of `from` and `to` should be in ISO 8601 format. - The `language` needs to be a two digits ISO 639-1 code. You can only use one language for a search, default is all. See `newsanchor::terms_language` for all possible languages. - Some limitations are restricted to the free plan of www.newsapi.org. For example, you cannot access articles that have been published more than a month ago. This can only be accomplished using the paid plan.
*Additional settings*
The default of https://newsapi.org only allows to get a maximum of 100 results per search. Since some search terms might yield much more results, you can use the option **page** to browse throught your results (per default: page = 1). You can change the number of results returned per request using **page_size** (per default: page_size = 100). ```{r everything: page, page_size} results <- get_everything(query = "Trump", page = 2) results <- get_everything(query = "Trump", page = 3, page_size = 20) ``` In addition, you can sort (**sort_by**) the order of the articles. Possible options are `relevancy` (articles more closely related to the query come first), `popularity` (articles from popular sources and publishers come first) and `publishedAt` (newest articles come first). Per default articles are sorted by `publishedAt`. ```{r everything: sort} # sort results by relevancy results <- get_everything(query = "Trump", sort_by = "relevancy") # sort results by popularity results <- get_everything(query = "Trump", sort_by = "popularity") # sort results by date results <- get_everything(query = "Trump", sort_by = "publishedAt") ``` *get_everything_all*
To automatically download all results, use **get_everything_all**. This function is build around `get_everything` and provides the same options (except for *page* and *page_size*). Be aware that here, the limit of 1,000 articles kicks in if you do not have a paid plan. The function will then provide you with a dataset of these 1,000 results and stop there. ```{r everything: get_all} results <- get_everything_all(query = "Trump") ``` *Searchterms*
We provide several dataframes with valid search terms. You can find all dataframes below: ```{r everything: searchterms} terms_language terms_sources terms_country terms_category ```
### Functions: get_sources This function returns the subset of news publishers that top headlines are available from. This information is also provided in a dataframe `terms_sources`. However, this function also allows to return sources for specific categories or languages. ```{r sources} publisher <- get_sources() ```
### Datasets / Searchterms
We provide several dataframes with possible options to use in your search for breaking headlines or news. These dataframes are provided in the data folder of the package: ```{r searchterms: dataframes} terms_category terms_country terms_language terms_sources ```
*Description: *
- The dataframe `terms_category` provides possible categories (e.g., sports) you want to get headlines for. This dataframe is relevant in conjunction with `get_headlines`. - The dataframe `terms_country` provides 2-letter ISO 3166-1 code of the country you want to get headlines for. This dataframe is relevant in conjunction with `get_headlines`. - The dataframe `terms_language` provides 2-letter ISO-639-1 code of the language you want to get news for. This dataframe is relevant in conjunction with `get_everything`. - The dataframe `terms_sources` provides provides possible news sources or blogs you want to get news from. This dataframe is relevant in conjunction with `get_everything`.

Finally, we provide a datafrme called `sample response`, which was mainly created for demonstrating purposes. The data set is used in the "Scrape New York Times Online Articles" vignette.