Introducing fodr: a package for French open data in R
Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.
Source from Wikipedia. Post image from Rude Baguette.
Nowadays, more and more government organisations subscribe to the open data movement and some have done so in France, in the hope that new services or insights would come from the analysis of this data. To name a few :
- the French government,
- the city of Paris,
- ERDF, the French electricity distribution company,
- SNCF, the French railroad company
- etc
Recently, due to the ever-increasing number of open data portals, the Open Data Inception portal was created by the OpenDataSoft company. The aim of this portal is to provide a comprehensive list of open data portals around the world.
As it happens, many of the French open data portals are available through the OpenDataSoft Open Data platform and thus share a common API.
Some of the datasets on these portals have caught my eyes recently and I downloaded a few of them to tinker with them on my spare time but I quickly realised that, even though the API is well designed, the process is a bit cumbersome if you want to use any data in R as you have to go on a specific portal, find the dataset you’re looking for, download it to some place on your disk and then go back to R and import it.
I thought by someone would have created a package to simplify this but I was mistaken. Thus, fodr
was born.
The package
Getting the package
The package is hosted on Github and can be installed via devtools
:
devtools::install_github("tutuchan/fodr")
Available portals
The list of currently available portals can be accessed with the list_portals()
function:
library(fodr)
library(dplyr)
library(leaflet)
list_portals() %>%
select(portals, base_urls)
## Source: local data frame [13 x 2]
##
## portals base_urls
## <chr> <chr>
## 1 ratp http://data.ratp.fr
## 2 iledefrance http://data.iledefrance.fr
## 3 infogreffe http://datainfogreffe.fr
## 4 toulouse https://data.toulouse-metropole.fr
## 5 star https://data.explore.star.fr
## 6 issy http://data.issy.com
## 7 stif http://opendata.stif.info
## 8 paris http://opendata.paris.fr
## 9 04 http://tourisme04.opendatasoft.com
## 10 62 http://tourisme62.opendatasoft.com
## 11 92 https://opendata.hauts-de-seine.fr
## 12 enesr http://data.enseignementsup-recherche.gouv.fr
## 13 erdf https://data.erdf.fr
At the time of writing only 13 portals are available through fodr
for several reasons :
fodr
only handles the OpenDataSoft API and many of the French open data portals use the ArcGIS Open platform,- some do not have an API, like the
data.gouv orGrand Lyon website, - some use the navitia.io API, like the SNCF website,
- some simply do not have data
Note from 2016/05/31 19:24: Joel Gombin graciously pointed out to me that the data.gouv DOES have an API (available here).
To obtain this list, I looked at the Open Data Inception dataset and tried to access a method on the API to see if it returned any results and curated the remaining datasets but I’m pretty sure there is a better way to do it:
library(jsonlite)
library(curl)
number_of_datasets <- fromJSON("http://public.opendatasoft.com/api/records/1.0/search/?dataset=open-data-sources&facet=country&refine.country=France&nrows=1")$nhits
listDatasets <- fromJSON(paste0("http://public.opendatasoft.com/api/records/1.0/search/?dataset=open-data-sources&facet=country&refine.country=France&rows=", number_of_datasets))
d <- lapply(1:number_of_datasets, function (i) {
if ("Arcgis" %in% listDatasets$records$fields$description[i]) return(NULL)
x <- listDatasets$records$fields$url[i]
fetch_data <- purrr::safely(function(x) "nhits" %in% names(fromJSON(paste0(x, "/api/datasets/1.0/search/"))))
if (is.null(fetch_data(x)$error)) dplyr::data_frame(name = listDatasets$records$fields$name[i], url = x) else NULL
})
df <- dplyr::bind_rows(d)
Structure
I created two classes using the R6 package: FODRPortal and FODRDataset. I also created two wrappers around these classes to skip the <class>$new(...)
syntax that I’m not a fan of:
# Instantiate a FODRPortal
fodr_portal(portal)
# Instantiate a FODRDataset
fodr_dataset(portal, id)
Portals
Instantiation
Access to a portal is granted through the fodr_portal
function:
portal <- fodr_portal("paris")
portal
## FODRPortal object
## --------------------------------------------------------------------
## Portal: paris
## Number of datasets: 175
## Themes:
## - Administration
## - Citoyens
## - Commerces
## - Culture
## - Déplacements
## - Environnement
## - Finances
## - Services
## - Urbanisme
## --------------------------------------------------------------------
The print
method has been overloaded to show useful information about the portal. Here, it shows that this portal has 175 datasets divided among several themes.
Fetching data - the search method
For a FODRPortal
, there is only one method: search
. It leverages the OpenDataSoft Catalog API in order to find relevant datasets. By default, calling the search
method will retrieve all elements that satisfy the query, contrary to the OpenDataSoft API that only retrieves 10.
There are several arguments to this function that are detailed in the OpenDataSoft documentation. I also added another argument, theme, that allows to return all datasets that fall under a specific theme.
list_datasets <- portal$search(theme = "Culture")
## 24 datasets found ...
Datasets obtained by this method are directly returned as a list but also stored in the data
field of the portal
object:
list_datasets[[1]]
## FODRDataset object
## --------------------------------------------------------------------
## Dataset id: postes-publics-des-bibliotheques-de-pret
## Theme: Culture
## Keywords: postes publics, ordinateurs, catalogues en ligne, bibliothèques, prêt
## Publisher: Service Informatique des Bibliothèques
## --------------------------------------------------------------------
## Number of records: 230
## Number of files: 0
## Modified: 2014-07-30
## Facets: etablissement, type_de_poste_public
## Sortables: nombre_d_ordinateurs
## --------------------------------------------------------------------
Datasets
Instantiation
As shown just before, datasets can be retrieved either from a call to portal$search
or, if you know the portal and id, by a call to fodr_dataset
:
dts <- fodr_dataset(portal = "paris", id = list_datasets[[1]]$id)
dts
## FODRDataset object
## --------------------------------------------------------------------
## Dataset id: postes-publics-des-bibliotheques-de-pret
## Theme: Culture
## Keywords: postes publics, ordinateurs, catalogues en ligne, bibliothèques, prêt
## Publisher: Service Informatique des Bibliothèques
## --------------------------------------------------------------------
## Number of records: 230
## Number of files: 0
## Modified: 2014-07-30
## Facets: etablissement, type_de_poste_public
## Sortables: nombre_d_ordinateurs
## --------------------------------------------------------------------
Then again, the print
method has been overloaded to show useful information about the dataset:
- number of records,
- number of files,
- keywords,
- facets: columns you can filter on,
- sortables: columns you can sort on
Fetching data
There are two methods to fetch data from a dataset: get_records
and get_attachments
.
The get_records method
The get_records
method leverages the Records download API and again fetches all available records instead of the first 10.
dfRecords <- dts$get_records()
dfRecords
## Source: local data frame [230 x 4]
##
## type_de_poste_public nombre_d_ordinateurs etablissement
## <chr> <int> <chr>
## 1 Catalogue 3 Aimé Césaire
## 2 Catalogue 2 Amélie
## 3 Poste de passage adulte 2 Amélie
## 4 Poste jeunesse 2 Amélie
## 5 Poste jeunesse 3 Andrée Chedid
## 6 Poste adulte 3 Batignolles
## 7 Poste de passage adulte 3 Buffon
## 8 Catalogue 2 Charlotte Delbo
## 9 Catalogue 8 Robert Sabatier
## 10 Poste adulte 8 Robert Sabatier
## .. ... ... ...
## Variables not shown: geom_x_y <list>.
refine and exclude
The refine and exclude arguments are used to filter results on facets:
dts$get_records(refine = list(etablissement = "Amélie"))
## Source: local data frame [5 x 4]
##
## type_de_poste_public nombre_d_ordinateurs etablissement
## <chr> <int> <chr>
## 1 Catalogue 2 Amélie
## 2 Poste de passage adulte 2 Amélie
## 3 Poste jeunesse 2 Amélie
## 4 Poste adulte 3 Amélie
## 5 Poste de passage jeunesse 1 Amélie
## Variables not shown: geom_x_y <list>.
dts$get_records(exclude = list(etablissement = "Amélie"))
## Source: local data frame [225 x 4]
##
## type_de_poste_public nombre_d_ordinateurs etablissement
## <chr> <int> <chr>
## 1 Catalogue 3 Aimé Césaire
## 2 Poste jeunesse 3 Andrée Chedid
## 3 Poste adulte 3 Batignolles
## 4 Poste de passage adulte 3 Buffon
## 5 Catalogue 2 Charlotte Delbo
## 6 Catalogue 8 Robert Sabatier
## 7 Poste adulte 8 Robert Sabatier
## 8 Poste jeunesse 2 Diderot
## 9 Poste de passage adulte 1 Crimée
## 10 Catalogue 1 Drouot
## .. ... ... ...
## Variables not shown: geom_x_y <list>.
sort
The sort argument is used to sort results on sortables:
dts$get_records(sort = "nombre_d_ordinateurs")
## Source: local data frame [230 x 4]
##
## type_de_poste_public nombre_d_ordinateurs etablissement
## <chr> <int> <chr>
## 1 Poste adulte 36 Marguerite Duras
## 2 Poste adulte 14 Marguerite Yourcenar
## 3 Catalogue 13 Marguerite Yourcenar
## 4 Poste adulte 13 André Malraux
## 5 Poste adulte 12 Jacqueline de Romilly
## 6 Poste adulte 12 Vaclav Havel
## 7 Poste adulte 12 Hélène Berr
## 8 Poste adulte 12 Edmond Rostand
## 9 Poste adulte 11 Andrée Chedid
## 10 Poste adulte 11 Buffon
## .. ... ... ...
## Variables not shown: geom_x_y <list>.
dts$get_records(sort = "-nombre_d_ordinateurs")
## Source: local data frame [230 x 4]
##
## type_de_poste_public nombre_d_ordinateurs etablissement
## <chr> <int> <chr>
## 1 Poste de passage adulte 1 Crimée
## 2 Catalogue 1 Drouot
## 3 Catalogue 1 Europe
## 4 Poste de passage jeunesse 1 Hélène Berr
## 5 Poste jeunesse 1 Lancry
## 6 Poste de passage adulte 1 L'Heure Joyeuse
## 7 Catalogue 1 Maurice Genevoix
## 8 Poste de passage adulte 1 Maurice Genevoix
## 9 Poste de passage adulte 1 MMP
## 10 Catalogue 1 Mortier
## .. ... ... ...
## Variables not shown: geom_x_y <list>.
q
The q argument is used to perform full-text search:
dts$get_records(q = "GENEVOIX")
## Source: local data frame [4 x 4]
##
## type_de_poste_public nombre_d_ordinateurs etablissement
## <chr> <int> <chr>
## 1 Poste adulte 1 Maurice Genevoix
## 2 Poste jeunesse 3 Maurice Genevoix
## 3 Catalogue 1 Maurice Genevoix
## 4 Poste de passage adulte 1 Maurice Genevoix
## Variables not shown: geom_x_y <list>.
geofilter.distance and geofilter.polygon
These are used to filter results based on their location:
- geofilter.distance takes a numeric vector of three elemnts: longitude and latitude of the center of a circle and radius of the circle (in meters)
- geofilter.polygon takes a data.frame of two columns named lat and lon which bounds the area where results are allowed to be
dfFilterDistance <- dts$get_records(geofilter.distance = c(48.8580602, 2.3089956, 1000))
dfFilterDistance
## Source: local data frame [10 x 5]
##
## type_de_poste_public nombre_d_ordinateurs etablissement dist
## <chr> <int> <chr> <chr>
## 1 Catalogue 2 Amélie 0
## 2 Poste de passage adulte 2 Amélie 0
## 3 Poste jeunesse 2 Amélie 0
## 4 Poste adulte 3 Amélie 0
## 5 Poste de passage jeunesse 1 Amélie 0
## 6 Poste de passage adulte 1 Saint Simon 817
## 7 Poste adulte 3 Saint Simon 817
## 8 Poste jeunesse 2 Saint Simon 817
## 9 Catalogue 4 Saint Simon 817
## 10 Poste de passage jeunesse 1 Saint Simon 817
## Variables not shown: geom_x_y <list>.
geofilter.polygon = data.frame(lat = c(48.883086, 48.979022, 48.883651) ,
lon = c(2.379072, 2.379930, 2.386968))
dfFilterPolygon <- dts$get_records(geofilter.polygon = geofilter.polygon)
dfFilterPolygon
## Source: local data frame [4 x 4]
##
## type_de_poste_public nombre_d_ordinateurs etablissement
## <chr> <int> <chr>
## 1 Poste de passage adulte 1 Crimée
## 2 Poste jeunesse 4 Crimée
## 3 Catalogue 1 Crimée
## 4 Poste adulte 1 Crimée
## Variables not shown: geom_x_y <list>.
Use in leaflet
If a dataset has geographical information, it will be either in the geom_x_y or in the geom column of the data field. Here is an example of how to use the geom_x_y field in a leaflet map.
positions <- dfRecords$geom_x_y %>%
purrr::transpose() %>%
lapply(unlist) %>%
as.data.frame
positions$etablissement = dfRecords$etablissement
positions <- positions %>%
distinct(x, .keep_all = TRUE) %>%
rename(lng = x, lat = y)
m <- leaflet(positions, width = "900px") %>%
addProviderTiles("CartoDB.Positron") %>%
addMarkers(clusterOptions = markerClusterOptions(), popup = ~etablissement)
m
Future work
I plan on improving the package in order to access other types of portals. Let me know if you have any suggestions or if you encounter any problems through the package Github page.
Leave a Comment