This plugins uses the ESRI® Arcgis online ® API and allows DSS users to:
- Geocode postal adresses (obtain geo coordinates) based on the complete address line or based on some address components
- Enrich data with a large set of data collections from 120 countries. Enrichment can be based on XY coordinates or names areas (like postcode)
This plugin is a perfect companion for users who want to enrich their dataset for analysis or for feature engineering.
This plugin requires an Arcgis online account. Users can buy credits directly from Arcgis online.
Plugin Information
Version | 0.1.7 |
---|---|
Author | Dataiku (Nicolas Gakrelidz) |
Released | 2016-05-04 |
Last updated | 2016-10-04 |
License | Apache Software License |
Source code | Github |
Reporting issues | Github |
How To Use
First of all, open an Arcgis online account at https://www.arcgis.com/home/signin.html
Depending on the use case:
- Geocode your postal addresses by adding a new esri-geo-enrichment geocoding recipe into your project
- Enrich your dataset containing XY coordinates or statistical named area:
- In both cases, create a recipe named “get content catalog for countries” and set the country or country list regarding the input data. If you want to get the entire set of data collections, add the dataset “enrichment API coverage” (no API call required)
- For Enrichment of XY coordinates:
- set the columns corresponding to the input dataset content
- chose the datacollections for enriching the data
- check the advanced configuration to save the geometry related and the batch size of XY coordinates per call to push the API.
- For enrichment based on named areas:
- set the columns corresponding to the input dataset content
- chose the datacollections for enriching the data
- In both cases, checking “Add derivative variables” will add all the percentages, averages, etc… of the requested data collections. This option may generate a large additional number of columns in the output dataset.
Additional information
This plugin will call the Arcgis online API. You need an Arcgis online account. You may want to check the cost of each API call which is different regarding the feature used (geocoding, geo enrichment, getting the data collections). Note that this plugin is developed for data storage usages.
The API only supports numerical identifiers (object ids)
Country names should be given in ISO format (could be given by the geocoding recipe or the dataset named “Show enrichment API coverage”). Country is required for enrichment. For geocoding, it’s recommended in order to improve the precision of returned results.
Practical recommendations
- Dataiku DSS doesn’t automatically backup your data. As the data acquired by this plugin has a cost, we recommend that you regularly backup the data collected by the plugin.An option is available in the enrichment recipes in order to export your data collected into the tmp folder of your DSS data dir.
- You may want to remove duplicated data in the input dataset before running the enrichment (geocoding or geo enrichment) to avoid N calls to the API for the same data. After the enrichment on unique input data, you may join your original data with the output dataset.
- Missing values in the input dataset are not submitted to the API.
- When performing enrichment for several countries, please note that the data collections are different (name and content) per country. Thus, a cross-country enrichment may generate a huge number of columns.You may choose either to “generate the output as key, value” that can be processed with a preparation script or to create an enrichment recipe per country.
- For an enrichment at a specific statistical named level (ex : postcode), you may try different settings on the datacollection level name to match before enriching a large dataset.For instance if you want to enrich data containing UK postcodes, you should first create a recipe “get content catalog for countries” and have a look at the output dataset to find the required layer_id. At that point, it’s not easy to choose between GB.PoscodeSectors, GB.PostcodeDistricts or GB.PostcodeAreas. This might depend on your input data.Thus, we recommend that you first create a small sample from the input data in order to check what is the corresponding Layer.NB: the input postcode must be written the right format for each layer. For example, for the Layer_id GB.PostcodeSectors, the postcode DL12 8UN should be formatted as DL12 8. Don’t forget that DSS Visual Prepare can help you in this matter
- The dataset “Show enrichment API coverage” is based on the country list available on the ESRI® API website as of 2016-02-17. If some new country are supported by the API, the plugin will be updated.
- For both geocoding and enrichment the plugin provides the log of each batch for each run pushed to the API in order to see which of data has been successfully processed or in error. You may want to use the log dataset in “Append” mode (in the Inputs/Outputs tabs of the recipes seetings)