Introducing fodr: a package for French open data in R

9 minutes read

Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.

Source from Wikipedia. Post image from Rude Baguette.

Nowadays, more and more government organisations subscribe to the open data movement and some have done so in France, in the hope that new services or insights would come from the analysis of this data. To name a few :

Recently, due to the ever-increasing number of open data portals, the Open Data Inception portal was created by the OpenDataSoft company. The aim of this portal is to provide a comprehensive list of open data portals around the world.

As it happens, many of the French open data portals are available through the OpenDataSoft Open Data platform and thus share a common API.

Some of the datasets on these portals have caught my eyes recently and I downloaded a few of them to tinker with them on my spare time but I quickly realised that, even though the API is well designed, the process is a bit cumbersome if you want to use any data in R as you have to go on a specific portal, find the dataset you’re looking for, download it to some place on your disk and then go back to R and import it.

I thought by someone would have created a package to simplify this but I was mistaken. Thus, fodr was born.

The package

Getting the package

The package is hosted on Github and can be installed via devtools:

devtools::install_github("tutuchan/fodr")

Available portals

The list of currently available portals can be accessed with the list_portals() function:

library(fodr)
library(dplyr)
library(leaflet)
list_portals() %>% 
  select(portals, base_urls)
## Source: local data frame [13 x 2]
## 
##        portals                                     base_urls
##          <chr>                                         <chr>
## 1         ratp                           http://data.ratp.fr
## 2  iledefrance                    http://data.iledefrance.fr
## 3   infogreffe                      http://datainfogreffe.fr
## 4     toulouse            https://data.toulouse-metropole.fr
## 5         star                  https://data.explore.star.fr
## 6         issy                          http://data.issy.com
## 7         stif                     http://opendata.stif.info
## 8        paris                      http://opendata.paris.fr
## 9           04            http://tourisme04.opendatasoft.com
## 10          62            http://tourisme62.opendatasoft.com
## 11          92            https://opendata.hauts-de-seine.fr
## 12       enesr http://data.enseignementsup-recherche.gouv.fr
## 13        erdf                          https://data.erdf.fr

At the time of writing only 13 portals are available through fodr for several reasons :

Note from 2016/05/31 19:24: Joel Gombin graciously pointed out to me that the data.gouv DOES have an API (available here).

To obtain this list, I looked at the Open Data Inception dataset and tried to access a method on the API to see if it returned any results and curated the remaining datasets but I’m pretty sure there is a better way to do it:

library(jsonlite)
library(curl)

number_of_datasets <-  fromJSON("http://public.opendatasoft.com/api/records/1.0/search/?dataset=open-data-sources&facet=country&refine.country=France&nrows=1")$nhits
listDatasets <-  fromJSON(paste0("http://public.opendatasoft.com/api/records/1.0/search/?dataset=open-data-sources&facet=country&refine.country=France&rows=", number_of_datasets))

d <- lapply(1:number_of_datasets, function (i) {
  if ("Arcgis" %in% listDatasets$records$fields$description[i]) return(NULL)
  x <- listDatasets$records$fields$url[i]
  fetch_data <- purrr::safely(function(x) "nhits" %in% names(fromJSON(paste0(x, "/api/datasets/1.0/search/"))))
  if (is.null(fetch_data(x)$error)) dplyr::data_frame(name = listDatasets$records$fields$name[i], url = x) else NULL
})

df <- dplyr::bind_rows(d)

Structure

I created two classes using the R6 package: FODRPortal and FODRDataset. I also created two wrappers around these classes to skip the <class>$new(...) syntax that I’m not a fan of:

# Instantiate a FODRPortal
fodr_portal(portal)

# Instantiate a FODRDataset
fodr_dataset(portal, id)

Portals

Instantiation

Access to a portal is granted through the fodr_portal function:

portal <- fodr_portal("paris")
portal
## FODRPortal object
## --------------------------------------------------------------------
## Portal: paris 
## Number of datasets: 175 
## Themes:
##   - Administration
##   - Citoyens
##   - Commerces
##   - Culture
##   - Déplacements
##   - Environnement
##   - Finances
##   - Services
##   - Urbanisme 
## --------------------------------------------------------------------

The print method has been overloaded to show useful information about the portal. Here, it shows that this portal has 175 datasets divided among several themes.

Fetching data - the search method

For a FODRPortal, there is only one method: search. It leverages the OpenDataSoft Catalog API in order to find relevant datasets. By default, calling the search method will retrieve all elements that satisfy the query, contrary to the OpenDataSoft API that only retrieves 10.

There are several arguments to this function that are detailed in the OpenDataSoft documentation. I also added another argument, theme, that allows to return all datasets that fall under a specific theme.

list_datasets <- portal$search(theme = "Culture")
## 24 datasets found ...

Datasets obtained by this method are directly returned as a list but also stored in the data field of the portal object:

list_datasets[[1]]
## FODRDataset object
## --------------------------------------------------------------------
## Dataset id: postes-publics-des-bibliotheques-de-pret 
## Theme: Culture 
## Keywords: postes publics, ordinateurs, catalogues en ligne, bibliothèques, prêt 
## Publisher: Service Informatique des Bibliothèques 
## --------------------------------------------------------------------
## Number of records: 230 
## Number of files: 0 
## Modified: 2014-07-30 
## Facets: etablissement, type_de_poste_public 
## Sortables: nombre_d_ordinateurs 
## --------------------------------------------------------------------

Datasets

Instantiation

As shown just before, datasets can be retrieved either from a call to portal$search or, if you know the portal and id, by a call to fodr_dataset:

dts <- fodr_dataset(portal = "paris", id = list_datasets[[1]]$id)
dts
## FODRDataset object
## --------------------------------------------------------------------
## Dataset id: postes-publics-des-bibliotheques-de-pret 
## Theme: Culture 
## Keywords: postes publics, ordinateurs, catalogues en ligne, bibliothèques, prêt 
## Publisher: Service Informatique des Bibliothèques 
## --------------------------------------------------------------------
## Number of records: 230 
## Number of files: 0 
## Modified: 2014-07-30 
## Facets: etablissement, type_de_poste_public 
## Sortables: nombre_d_ordinateurs 
## --------------------------------------------------------------------

Then again, the print method has been overloaded to show useful information about the dataset:

  • number of records,
  • number of files,
  • keywords,
  • facets: columns you can filter on,
  • sortables: columns you can sort on

Fetching data

There are two methods to fetch data from a dataset: get_records and get_attachments.

The get_records method

The get_records method leverages the Records download API and again fetches all available records instead of the first 10.

dfRecords <- dts$get_records()
dfRecords
## Source: local data frame [230 x 4]
## 
##       type_de_poste_public nombre_d_ordinateurs   etablissement
##                      <chr>                <int>           <chr>
## 1                Catalogue                    3    Aimé Césaire
## 2                Catalogue                    2          Amélie
## 3  Poste de passage adulte                    2          Amélie
## 4           Poste jeunesse                    2          Amélie
## 5           Poste jeunesse                    3   Andrée Chedid
## 6             Poste adulte                    3     Batignolles
## 7  Poste de passage adulte                    3          Buffon
## 8                Catalogue                    2 Charlotte Delbo
## 9                Catalogue                    8 Robert Sabatier
## 10            Poste adulte                    8 Robert Sabatier
## ..                     ...                  ...             ...
## Variables not shown: geom_x_y <list>.
refine and exclude

The refine and exclude arguments are used to filter results on facets:

dts$get_records(refine = list(etablissement = "Amélie"))
## Source: local data frame [5 x 4]
## 
##        type_de_poste_public nombre_d_ordinateurs etablissement
##                       <chr>                <int>         <chr>
## 1                 Catalogue                    2        Amélie
## 2   Poste de passage adulte                    2        Amélie
## 3            Poste jeunesse                    2        Amélie
## 4              Poste adulte                    3        Amélie
## 5 Poste de passage jeunesse                    1        Amélie
## Variables not shown: geom_x_y <list>.
dts$get_records(exclude = list(etablissement = "Amélie"))
## Source: local data frame [225 x 4]
## 
##       type_de_poste_public nombre_d_ordinateurs   etablissement
##                      <chr>                <int>           <chr>
## 1                Catalogue                    3    Aimé Césaire
## 2           Poste jeunesse                    3   Andrée Chedid
## 3             Poste adulte                    3     Batignolles
## 4  Poste de passage adulte                    3          Buffon
## 5                Catalogue                    2 Charlotte Delbo
## 6                Catalogue                    8 Robert Sabatier
## 7             Poste adulte                    8 Robert Sabatier
## 8           Poste jeunesse                    2         Diderot
## 9  Poste de passage adulte                    1          Crimée
## 10               Catalogue                    1          Drouot
## ..                     ...                  ...             ...
## Variables not shown: geom_x_y <list>.
sort

The sort argument is used to sort results on sortables:

dts$get_records(sort = "nombre_d_ordinateurs")
## Source: local data frame [230 x 4]
## 
##    type_de_poste_public nombre_d_ordinateurs         etablissement
##                   <chr>                <int>                 <chr>
## 1          Poste adulte                   36      Marguerite Duras
## 2          Poste adulte                   14  Marguerite Yourcenar
## 3             Catalogue                   13  Marguerite Yourcenar
## 4          Poste adulte                   13         André Malraux
## 5          Poste adulte                   12 Jacqueline de Romilly
## 6          Poste adulte                   12          Vaclav Havel
## 7          Poste adulte                   12           Hélène Berr
## 8          Poste adulte                   12        Edmond Rostand
## 9          Poste adulte                   11         Andrée Chedid
## 10         Poste adulte                   11                Buffon
## ..                  ...                  ...                   ...
## Variables not shown: geom_x_y <list>.
dts$get_records(sort = "-nombre_d_ordinateurs")
## Source: local data frame [230 x 4]
## 
##         type_de_poste_public nombre_d_ordinateurs    etablissement
##                        <chr>                <int>            <chr>
## 1    Poste de passage adulte                    1           Crimée
## 2                  Catalogue                    1           Drouot
## 3                  Catalogue                    1           Europe
## 4  Poste de passage jeunesse                    1      Hélène Berr
## 5             Poste jeunesse                    1           Lancry
## 6    Poste de passage adulte                    1  L'Heure Joyeuse
## 7                  Catalogue                    1 Maurice Genevoix
## 8    Poste de passage adulte                    1 Maurice Genevoix
## 9    Poste de passage adulte                    1              MMP
## 10                 Catalogue                    1          Mortier
## ..                       ...                  ...              ...
## Variables not shown: geom_x_y <list>.
q

The q argument is used to perform full-text search:

dts$get_records(q = "GENEVOIX")
## Source: local data frame [4 x 4]
## 
##      type_de_poste_public nombre_d_ordinateurs    etablissement
##                     <chr>                <int>            <chr>
## 1            Poste adulte                    1 Maurice Genevoix
## 2          Poste jeunesse                    3 Maurice Genevoix
## 3               Catalogue                    1 Maurice Genevoix
## 4 Poste de passage adulte                    1 Maurice Genevoix
## Variables not shown: geom_x_y <list>.
geofilter.distance and geofilter.polygon

These are used to filter results based on their location:

  • geofilter.distance takes a numeric vector of three elemnts: longitude and latitude of the center of a circle and radius of the circle (in meters)
  • geofilter.polygon takes a data.frame of two columns named lat and lon which bounds the area where results are allowed to be
dfFilterDistance <- dts$get_records(geofilter.distance = c(48.8580602, 2.3089956, 1000))
dfFilterDistance
## Source: local data frame [10 x 5]
## 
##         type_de_poste_public nombre_d_ordinateurs etablissement  dist
##                        <chr>                <int>         <chr> <chr>
## 1                  Catalogue                    2        Amélie     0
## 2    Poste de passage adulte                    2        Amélie     0
## 3             Poste jeunesse                    2        Amélie     0
## 4               Poste adulte                    3        Amélie     0
## 5  Poste de passage jeunesse                    1        Amélie     0
## 6    Poste de passage adulte                    1   Saint Simon   817
## 7               Poste adulte                    3   Saint Simon   817
## 8             Poste jeunesse                    2   Saint Simon   817
## 9                  Catalogue                    4   Saint Simon   817
## 10 Poste de passage jeunesse                    1   Saint Simon   817
## Variables not shown: geom_x_y <list>.
geofilter.polygon = data.frame(lat = c(48.883086, 48.979022, 48.883651) , 
                               lon = c(2.379072, 2.379930, 2.386968))
dfFilterPolygon <- dts$get_records(geofilter.polygon = geofilter.polygon)
dfFilterPolygon
## Source: local data frame [4 x 4]
## 
##      type_de_poste_public nombre_d_ordinateurs etablissement
##                     <chr>                <int>         <chr>
## 1 Poste de passage adulte                    1        Crimée
## 2          Poste jeunesse                    4        Crimée
## 3               Catalogue                    1        Crimée
## 4            Poste adulte                    1        Crimée
## Variables not shown: geom_x_y <list>.

Use in leaflet

If a dataset has geographical information, it will be either in the geom_x_y or in the geom column of the data field. Here is an example of how to use the geom_x_y field in a leaflet map.

positions <- dfRecords$geom_x_y %>% 
  purrr::transpose() %>% 
  lapply(unlist) %>% 
  as.data.frame
positions$etablissement = dfRecords$etablissement

positions <- positions %>% 
  distinct(x, .keep_all = TRUE) %>% 
  rename(lng = x, lat = y)
m <- leaflet(positions, width = "900px") %>% 
  addProviderTiles("CartoDB.Positron") %>% 
  addMarkers(clusterOptions = markerClusterOptions(), popup = ~etablissement)
m

Future work

I plan on improving the package in order to access other types of portals. Let me know if you have any suggestions or if you encounter any problems through the package Github page.

Updated:

Leave a Comment