This is an introductory vignette to the
newsanchor package. The package connects to https://newsapi.org/.
News API is a simple HTTP REST API for searching and
retrieving (live) articles from all over the web. You can get breaking
news headlines, and search for articles from over 30,000 news sources
and blogs.
The API can generally be used for free for
non-commercial use cases. However, please be aware, that this results in
some restrictions. For instance, you cannot download more than 1,000
results in a given search. Please see https://newsapi.org/pricing for commercial
options.
The package helps you to answer questions like the following:
The package is publicly available at GitHub and hopefully soon at
CRAN as well. In this vignette we demonstrate basic features of the
package. In a second vignette, we show how the package can be used to
scrape your results from an example online newspaper (i.e., New York
Times).
You need an API key to access https://newsapi.org. You can apply for a key at https://newsapi.org/register/. The function below
appends the key automatically to your R environment file. Hence, every
time you start R, the key is loaded. The functions
get_headlines
, get_headlines_all
,
get_everything
, get_everything_all
, and
get_sources
access your key automatically by executing
Sys.getenv("NEWS_API_KEY")
. Alternatively, you can provide
an explicit definition of your api_key with each function call.
# save the api_key in the .Renviron file
set_api_key(api_key = "YOUR API KEY",
path = "~/.Renviron")
These functions connect to the API’s endpoint that returns breaking news headlines for a country and category, or currently running on a single or multiple sources. This is perfect if you want to download live up-to-date news headlines and images. You can provide specific news sources, categories, countries, and/or queries. For valid searchterms see the respective paragraph below.
# get headlines published by the Washington Post
results <- get_headlines(sources = "the-washington-post")
# get headlines published in the category sports
results <- get_headlines(category = "sports")
# get headlines published in Germany
results <- get_headlines(country = "de")
# get headlines published about Trump
results <- get_headlines(query = "Trump")
Limitations
category
, country
, or query
.
Additional elements will be ignored.sources
must not be combined with
country
and/or category
.
Additional settings
Regarding
get_headlines()
, the default of https://newsapi.org only
allows to get a maximum of 100 results per request. Since some search
terms might yield more results, you can use the option
page to browse through your results (per default: page
= 1). You can change the number of results returned per request using
page_size (per default: page_size = 100).
results <- get_headlines(category = "sports", page = 2)
results <- get_headlines(category = "sports", page = 3, page_size = 20)
get_headlines_all
To automatically download all results,
use get_headlines_all()
. This function is build around
get_headlines
and provides the same options (except for
page
and page_size
). It returns a data frame
with all results.
Searchterms
We provide different dataframes
:terms_category
, terms_sources
, and
terms_country
. They contain information on valid search
terms. Access them if you are not sure whether some country or some
sources will work.
This endpoint searches through articles from large and small news
sources and blogs. This includes news as well as other kinds of normal
articles. This function needs a compulsory query
with each
function call. You can make your search more fine-tuned using the
following additional options:
# get everything published about Trump
results <- get_everything(query = "Trump")
# get everything published with the phrase "Trump says"
results <- get_everything(query = "Trump says +migrants")
# get everything published with the phrase "Trump" and "migrants" but without "Mexico"
results <- get_everything(query = "Trump +migrants -Mexico")
As stated above, you always have to provide a query
with the function call. However, you can limit your results to multiple
sources, different languages, or
different domains. You are also able to
exclude_domains or mark the start
(from) or end (to) time of your
search.
# get everything published about Trump
results <- get_everything(query = "Trump")
# get everything published about Trump in the Washington Post
results <- get_everything(query = "Trump", sources = "the-washington-post")
# get everything published about Trump in french
results <- get_everything(query = "Trump", languages = "fr")
# get everything published about Trump at bbc.com
results <- get_everything(query = "Trump", domains = "bbc.com")
# get everything published about Trump BUT NOT at bbc.com
results <- get_everything(query = "Trump", exclude_domains = "bbc.com")
# get everything published about Trump BUT NOT at bbc.com
results <- get_everything(query = "Trump", from = "2018-09-08")
Limitations
sources
, there is a limitation to
20 sources.from
and to
should be in
ISO 8601 format.language
needs to be a two digits ISO 639-1 code.
You can only use one language for a search, default is all. See
newsanchor::terms_language
for all possible languages.
Additional settings
The default of https://newsapi.org only
allows to get a maximum of 100 results per search. Since some search
terms might yield much more results, you can use the option
page to browse throught your results (per default: page
= 1). You can change the number of results returned per request using
page_size (per default: page_size = 100).
results <- get_everything(query = "Trump", page = 2)
results <- get_everything(query = "Trump", page = 3, page_size = 20)
In addition, you can sort (sort_by) the order of the
articles. Possible options are relevancy
(articles more
closely related to the query come first), popularity
(articles from popular sources and publishers come first) and
publishedAt
(newest articles come first). Per default
articles are sorted by publishedAt
.
# sort results by relevancy
results <- get_everything(query = "Trump", sort_by = "relevancy")
# sort results by popularity
results <- get_everything(query = "Trump", sort_by = "popularity")
# sort results by date
results <- get_everything(query = "Trump", sort_by = "publishedAt")
get_everything_all
To automatically download all
results, use get_everything_all. This function is build
around get_everything
and provides the same options (except
for page and page_size). Be aware that here, the limit
of 1,000 articles kicks in if you do not have a paid plan. The function
will then provide you with a dataset of these 1,000 results and stop
there.
Searchterms
We provide several dataframes with valid
search terms. You can find all dataframes below:
This function returns the subset of news publishers that top
headlines are available from. This information is also provided in a
dataframe terms_sources
. However, this function also allows
to return sources for specific categories or languages.
We provide several dataframes with possible options to use in your search for breaking headlines or news. These dataframes are provided in the data folder of the package:
Description:
terms_category
provides possible
categories (e.g., sports) you want to get headlines for. This dataframe
is relevant in conjunction with get_headlines
.terms_country
provides 2-letter ISO
3166-1 code of the country you want to get headlines for. This dataframe
is relevant in conjunction with get_headlines
.terms_language
provides 2-letter
ISO-639-1 code of the language you want to get news for. This dataframe
is relevant in conjunction with get_everything
.terms_sources
provides provides possible
news sources or blogs you want to get news from. This dataframe is
relevant in conjunction with get_everything
. Finally, we provide a datafrme called sample response
,
which was mainly created for demonstrating purposes. The data set is
used in the “Scrape New York Times Online Articles” vignette.