Introducing fodr: a package for French open data in R
9 minutes read
Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.
Nowadays, more and more government organisations subscribe to the open data movement and some have done so in France, in the hope that new services or insights would come from the analysis of this data. To name a few :
Recently, due to the ever-increasing number of open data portals, the Open Data Inception portal was created by the OpenDataSoft company. The aim of this portal is to provide a comprehensive list of open data portals around the world.
As it happens, many of the French open data portals are available through the OpenDataSoft Open Data platform and thus share a common API.
Some of the datasets on these portals have caught my eyes recently and I downloaded a few of them to tinker with them on my spare time but I quickly realised that, even though the API is well designed, the process is a bit cumbersome if you want to use any data in R as you have to go on a specific portal, find the dataset you’re looking for, download it to some place on your disk and then go back to R and import it.
I thought by someone would have created a package to simplify this but I was mistaken. Thus, fodr was born.
The package
Getting the package
The package is hosted on Github and can be installed viadevtools:
Available portals
The list of currently available portals can be accessed with the list_portals() function:
At the time of writing only 13 portals are available through fodr for several reasons :
fodr only handles the OpenDataSoft API and many of the French open data portals use the ArcGIS Open platform,
To obtain this list, I looked at the Open Data Inception dataset and tried to access a method on the API to see if it returned any results and curated the remaining datasets but I’m pretty sure there is a better way to do it:
Structure
I created two classes using the R6 package: FODRPortal and FODRDataset. I also created two wrappers around these classes to skip the <class>$new(...) syntax that I’m not a fan of:
Portals
Instantiation
Access to a portal is granted through the fodr_portal function:
The print method has been overloaded to show useful information about the portal. Here, it shows that this portal has 175 datasets divided among several themes.
Fetching data - the search method
For a FODRPortal, there is only one method: search. It leverages the OpenDataSoftCatalog API in order to find relevant datasets. By default, calling the search method will retrieve all elements that satisfy the query, contrary to the OpenDataSoft API that only retrieves 10.
There are several arguments to this function that are detailed in the OpenDataSoft documentation. I also added another argument, theme, that allows to return all datasets that fall under a specific theme.
Datasets obtained by this method are directly returned as a list but also stored in the data field of the portal object:
Datasets
Instantiation
As shown just before, datasets can be retrieved either from a call to portal$search or, if you know the portal and id, by a call to fodr_dataset:
Then again, the print method has been overloaded to show useful information about the dataset:
number of records,
number of files,
keywords,
facets: columns you can filter on,
sortables: columns you can sort on
Fetching data
There are two methods to fetch data from a dataset: get_records and get_attachments.
The get_records method
The get_records method leverages the Records download API and again fetches all available records instead of the first 10.
refine and exclude
The refine and exclude arguments are used to filter results on facets:
sort
The sort argument is used to sort results on sortables:
q
The q argument is used to perform full-text search:
geofilter.distance and geofilter.polygon
These are used to filter results based on their location:
geofilter.distance takes a numeric vector of three elemnts: longitude and latitude of the center of a circle and radius of the circle (in meters)
geofilter.polygon takes a data.frame of two columns named lat and lon which bounds the area where results are allowed to be
Use in leaflet
If a dataset has geographical information, it will be either in the geom_x_y or in the geom column of the data field. Here is an example of how to use the geom_x_y field in a leaflet map.
Future work
I plan on improving the package in order to access other types of portals. Let me know if you have any suggestions or if you encounter any problems through the package Github page.
Leave a Comment