Documentation > Workflows & Code > Creating Extracts > IPUMS NHGIS Extracts

Create IPUMS NHGIS Data Extracts

Below we provide examples in R, Python and curl showing how to work with the IPUMS API to create and manage NHGIS data extracts.

The Python and R examples are currently implemented using HTTP and JSON libraries to make and parse RESTful HTTP calls. As support for using the IPUMS API to interact with the NHGIS data collection is added to the SDKs, this documentation will be updated to utilize those libraries instead.

Get your key from [https://account.ipums.org/api_keys]. Make sure to replace ‘MY_KEY’ (all caps) in the snippet below with your key.

Load Libraries and Set Key

For R, you may have to install the httr and jsonlite libraries if they are not already installed.

import requests
import json
my_key = MY_KEY

library(httr)
library(jsonlite)
my_key <- MY_KEY

export MY_KEY=MY_KEY # set the MY_KEY environment variable using bash shell

Submit a Data Extract Request

To submit a data extract request you need to pass a valid JSON-formatted extract request in the body of your POST. The names to use for values in the data extract request can be discovered via our metadata API endpoints.

Data Extract Request Fields

datasets: An object where each key is the name of the requested dataset and each value is another object describing your selections for that dataset.
- data_tables: (Required) A list of selected data table names.
- geog_levels: (Required) A list of selected geographic level names.
- years: A list of selected years. To select all years use ["*"]. Only required when the dataset has multiple years.
- breakdown_values: A list of selected breakdown values. Defaults to first breakdown value. If more than one is selected, then specify breakdown_and_data_type_layout at the root of the request body.
time_series_tables: An object where each key is the name of the requested time series table and each value is another object describing your selections for that time series table.
- geog_levels: (Required) A list of selected geographic level names.
shapefiles: A list of selected shapefiles.
description: A short description of your extract.
data_format: The requested format of your data. Valid choices are: csv_no_header, csv_header, and fixed_width. csv_header adds a second, more descriptive header row. Contrary to the name, csv_no_header still provides a minimal header in the first row. Required when any datasets or time_series_tables are selected.
breakdown_and_data_type_layout: The layout of your dataset data when multiple data types or breakdown combos are present. Valid choices are: separate_files (split up each data type or breakdown combo into its own file) and single_file (keep all datatypes and breakdown combos in one file). Required when a dataset has multiple breakdowns or data types.
time_series_table_layout: The layout of your time series table data. Valid choices are: time_by_column_layout, time_by_row_layout, and time_by_file_layout. Required when any time series tables are selected. See the NHGIS documentation for more information.
geographic_extents: A list of geographic_instances to use as extents for all datasets on this request. To select all extents, use ["*"]. Only applies to geographic levels where has_geog_extent_selection is true. Required when a geographic level on a dataset is specified where has_geog_extent_selection is true.

my_headers = {"Authorization": my_key}
url = "https://api.ipums.org/extracts/?collection=nhgis&version=v1"
er = """

{
  "datasets": {
    "1988_1997_CBPa": {
      "years": ["1988", "1989", "1990", "1991", "1992", "1993", "1994"],
      "breakdown_values": ["bs30.si0762", "bs30.si2026"],
      "data_tables": [
        "NT001"
      ],
      "geog_levels": [
        "county"
      ]
    },
    "2000_SF1b": {
      "data_tables": [
        "NP001A"
      ],
      "geog_levels": [
        "blck_grp"
      ]
    }
  },
  "time_series_tables": {
    "A00": {
      "geog_levels": [
        "state"
      ]
    }
  },
  "shapefiles": [
    "us_state_1790_tl2000"
  ],
  "time_series_table_layout": "time_by_file_layout",
  "geographic_extents": ["010"],
  "data_format": "csv_no_header",
  "description": "sample6",
  "breakdown_and_data_type_layout": "single_file"
}

"""
result = requests.post(url, headers=my_headers, json=json.loads(er))
my_extract_number = result.json()["number"]
print(my_extract_number)

# Results
9

url <- "https://api.ipums.org/extracts/?collection=nhgis&version=v1"
mybody <- '

{
  "datasets": {
    "1988_1997_CBPa": {
      "years": ["1988", "1989", "1990", "1991", "1992", "1993", "1994"],
      "breakdown_values": ["bs30.si0762", "bs30.si2026"],
      "data_tables": [
        "NT001"
      ],
      "geog_levels": [
        "county"
      ]
    },
    "2000_SF1b": {
      "data_tables": [
        "NP001A"
      ],
      "geog_levels": [
        "blck_grp"
      ]
    }
  },
  "time_series_tables": {
    "A00": {
      "geog_levels": [
        "state"
      ]
    }
  },
  "shapefiles": [
    "us_state_1790_tl2000"
  ],
  "time_series_table_layout": "time_by_file_layout",
  "geographic_extents": ["010"],
  "data_format": "csv_no_header",
  "description": "sample6",
  "breakdown_and_data_type_layout": "single_file"
}

'
mybody_json <- fromJSON(mybody, simplifyVector = FALSE)
result <- POST(url, add_headers(Authorization = my_key), body = mybody_json, encode = "json", verbose())
res_df <- content(result, "parsed", simplifyDataFrame = TRUE)
my_number <- res_df$number
my_number
      
# Results

[1] 9

curl -X POST \
  "https://api.ipums.org/extracts/?collection=nhgis&version=v1" \
  -H "Content-Type: application/json" \
  -H "Authorization: $MY_KEY" \
  -d '
{
  "datasets": {
    "1988_1997_CBPa": {
      "years": ["1988", "1989", "1990", "1991", "1992", "1993", "1994"],
      "breakdown_values": ["bs30.si0762", "bs30.si2026"],
      "data_tables": [
        "NT001"
      ],
      "geog_levels": [
        "county"
      ]
    },
    "2000_SF1b": {
      "data_tables": [
        "NP001A"
      ],
      "geog_levels": [
        "blck_grp"
      ]
    }
  },
  "time_series_tables": {
    "A00": {
      "geog_levels": [
        "state"
      ]
    }
  },
  "shapefiles": [
    "us_state_1790_tl2000"
  ],
  "time_series_table_layout": "time_by_file_layout",
  "geographic_extents": ["010"],
  "data_format": "csv_no_header",
  "description": "sample6",
  "breakdown_and_data_type_layout": "single_file"
}
'

Get a Request’s Status

After submitting your extract request, you can use the extract number to retrieve the request’s status. Here we’re retrieving status for extract number 743.

r = requests.get(
    "https://api.ipums.org/extracts/743?collection=nhgis&version=v1",
    headers=my_headers
)

pprint(r.json())

{'data_format': 'csv_header',
 'datasets': {'1790_cPop': {'data_tables': ['NT2'], 'geog_levels': ['state']}},
 'description': 'testing123',
 'download_links': {'codebook_preview': 'https://demo.data2.nhgis.org/extracts/325460ab-055e-11e5-9e17-9c961dceb418/743/nhgis0743_csv_PREVIEW.zip',
                    'table_data': 'https://demo.data2.nhgis.org/extracts/325460ab-055e-11e5-9e17-9c961dceb418/743/nhgis0743_csv.zip'},
 'number': 743,
 'status': 'completed',
 'time_series_table_layout': 'time_by_row_layout',
 'time_series_tables': {'B79': {'geog_levels': ['state']}}}

data_extract_status_res <- GET("https://api.ipums.org/extracts/743?collection=nhgis&version=v1", add_headers(Authorization = my_key))
des_df <- content(data_extract_status_res, "parsed", simplifyDataFrame = TRUE)
des_df

$data_format
[1] "csv_header"

$description
[1] "testing123"

$time_series_table_layout
[1] "time_by_row_layout"

$datasets
$datasets$`1790_cPop`
$datasets$`1790_cPop`$data_tables
$datasets$`1790_cPop`$data_tables[[1]]
[1] "NT2"

$datasets$`1790_cPop`$geog_levels
$datasets$`1790_cPop`$geog_levels[[1]]
[1] "state"

$time_series_tables
$time_series_tables$B79
$time_series_tables$B79$geog_levels
$time_series_tables$B79$geog_levels[[1]]
[1] "state"

$number
[1] 743

$status
[1] "completed"

$download_links
$download_links$codebook_preview
[1] "https://demo.data2.nhgis.org/extracts/325460ab-055e-11e5-9e17-9c961dceb418/743/nhgis0743_csv_PREVIEW.zip"

$download_links$table_data
[1] "https://demo.data2.nhgis.org/extracts/325460ab-055e-11e5-9e17-9c961dceb418/743/nhgis0743_csv.zip"

curl -X GET "https://api.ipums.org/extracts/743?collection=nhgis&version=v1" -H "Content-Type: application/json" -H "Authorization: $MY_KEY"

# response:

{
    "data_format": "csv_header",
    "description": "testing123",
    "time_series_table_layout": "time_by_row_layout",
    "datasets": {
        "1790_cPop": {
            "data_tables": [
                "NT2"
            ],
            "geog_levels": [
                "state"
            ]
        }
    },
    "time_series_tables": {
        "B79": {
            "geog_levels": [
                "state"
            ]
        }
    },
    "number": 743,
    "status": "completed",
    "download_links": {
        "codebook_preview": "https://demo.data2.nhgis.org/extracts/325460ab-055e-11e5-9e17-9c961dceb418/743/nhgis0743_csv_PREVIEW.zip",
        "table_data": "https://demo.data2.nhgis.org/extracts/325460ab-055e-11e5-9e17-9c961dceb418/743/nhgis0743_csv.zip"
    }
}

You will get a status such as queued, started, produced canceled, failed or completed.

Retrieving Your Extract

To retrieve a completed extract (using extract number 743 as the example again):

Using the request status query above, wait until the status is completed.
Extract the download URL from the response, which is in the download_links attribute:

r = requests.get(
    "https://api.ipums.org/extracts/743?collection=nhgis&version=v1",
    headers=my_headers
)
extract = r.json()
my_extract_links = extract["download_links"]

data_extract_status_res <- GET("https://api.ipums.org/extracts/743?collection=nhgis&version=v1", add_headers(Authorization = my_key))
des_df <- content(data_extract_status_res, "parsed", simplifyDataFrame = TRUE)
des_df$download_links

curl -X GET \
  "https://api.ipums.org/extracts/743?collection=nhgis&version=v1" \
  -H "Content-Type: application/json" \
  -H "Authorization: $MY_KEY"

The download_links portion of the response will look like:

"download_links": {
        "codebook_preview": "https://api.ipums.org/downloads/nhgis/api/v1/extracts/9123456/nhgis0033_csv_PREVIEW.zip",
        "table_data": "https://api.ipums.org/downloads/nhgis/api/v1/extracts/9123456/nhgis0033_csv.zip",
        "gis_data": "https://api.ipums.org/downloads/nhgis/api/v1/extracts/9123456/nhgis0033_shape.zip"
    },

Next, retrieve the file(s) from the URL. You will need to pass the Authorization header with your API key to the server in order to download the data.

# get the file from the URL and write out to a local file 
r = requests.get(my_extract_links["table_data"], allow_redirects=True, headers=my_headers)
open("nhgis0061_csv.zip", "wb").write(r.content)

# Retrieve the file from the URL and read it into R using the ipumsr 
# library (https://cran.r-project.org/web/packages/ipumsr/index.html).

# Import the ipumsr library
library(ipumsr)

# Download table data and read into a data frame
# Destination file
zip_file <- "NHGIS_tables.zip"
# Download extract to destination file
download.file(des_df$download_links$table_data, zip_file, headers=c(Authorization=my_key))
# List extract files in ZIP archive
unzip(zip_file, list=TRUE)
# Read 2000 block-group CSV file into a data frame
bg2000_table <- read_nhgis(zip_file, data_layer = contains("2000_blck_grp.csv"))
head(bg2000_table)

curl -H "Authorization: $MY_KEY" "https://api.ipums.org/downloads/nhgis/api/v1/extracts/9123456/nhgis0033_csv.zip" > mydata.zip

Now you are ready for further processing and analysis as you desire.

Get a Listing of Recent Extract Requests

You may also find it useful to get a historical listing of your extract requests. If you omit an extract number in your API call, by default this will return the 10 most recent extract requests. To adjust the amount returned, you may optionally specify a ?limit=## parameter to get the ## most recent extracts instead.

r = requests.get(
    "https://api.ipums.org/extracts?collection=nhgis&version=v1",
    headers=my_headers
)

pprint(r.json()[0:5])
    [{'data_format': 'csv_header',
      'datasets': {'1790_cPop': {'data_tables': ['NT2'], 'geog_levels': ['state']}},
      'description': 'testing123',
      'download_links': {},
      'number': 61,
      'status': 'started'},
     {'data_format': 'csv_header',
      'datasets': {'2006_2010_ACS5a': {'data_tables': ['B01001', 'B15002'],
                                       'geog_levels': ['state']}},
      'description': 'test',
      'download_links': {},
      'number': 60,
      'status': 'completed'},
     {'data_format': 'csv_header',
      'datasets': {'2009_2013_ACS5a': {'data_tables': ['B25003'],
                                       'geog_levels': ['puma']}},
      'description': 'Revision of 56: PUMA in 2013 5-year file',
      'download_links': {},
      'number': 59,
      'status': 'completed'},
     {'data_format': 'csv_header',
      'datasets': {'2017_ACS1': {'data_tables': ['B01001'],
                                 'geog_levels': ['nation']}},
      'description': '',
      'download_links': {},
      'number': 58,
      'status': 'completed'},
     {'data_format': 'csv_header',
      'datasets': {'2009_2013_ACS5a': {'data_tables': ['B25003'],
                                       'geog_levels': ['puma']}},
      'description': 'PUMA in 2013 5-year file',
      'download_links': {},
      'number': 56,
      'status': 'completed'}]

for extract in r.json():
  if extract["number"] == my_extract_number:
  my_extract_status = extract["status"]
  break
print(my_extract_links)

data_extract_status_res <- GET("https://api.ipums.org/extracts?collection=nhgis&version=v1", add_headers(Authorization = my_key))
de10_df <- content(data_extract_status_res, "parsed", simplifyDataFrame = TRUE)
de10_df[,c("number","status","description")]

curl -X GET \
  "https://api.ipums.org/extracts?collection=nhgis&version=v1" \
  -H "Content-Type: application/json" \
  -H "Authorization: $MY_KEY"