15 Common Use Cases
This chapter includes two common use cases:
The first case study example substitutes the default household population data (estimation dataset) with a locally-specific US Census data Public Use Microdata Sample (PUMS) - a valuable way to get your VE model to reflect local conditions - and then rebuilds all the necessary packages reliant on the PUMS data for some of the estimation work.
The second case study example shows how to use different data that is used to build internal VisionEval modules – in this case to adjust future fleet composition information.
Both Use Cases will identify the differences in rebuilding the package data depending on what type of VisionEval install process that was used.
15.1 Case Study 1: Using local PUMS data
15.1.1 What are PUMS?
To summarize, the US Census Bureau provides anonymized data in two general forms:
Aggregated census tables - These tables provide the total or estimated counts by topic (e.g., total number of persons by age group). The smallest geographic unit are census blocks, but not all data are available at that level.
Disaggregated PUMS - A sample of individual record-level data for each person or household counted. (e.g., a persons age, gender, employment and the household they belong to.). The smallest geographic unit are Public Use Microdata Areas (PUMAs), which are aggregated areas to protect confidentiality and must include at least 100,000 persons.
Most people are at least somewhat familiar with the US Census and the information they collect. The primary function of the US Census is to collect a count of people living in the United States for federal allocation of political representatives and taxes. However, the US Census has since expanded to include a variety of other useful statistical information regarding demographics and employment. Census data are spatially organized into a hierarchy of sub-divided spatial areas, the smallest of which is called a Census Blocks, which aggregate into Block Groups, Tracts, Counties, and States. See the example figure below:
source: US Census
The primary census program is the Decennial Census, which is a comprehensive count collected every 10 years. However, because populations can significantly change within a decade, the American Community Survey (ACS) was created to obtain data at more frequent intervals. Rather than a full census, the ACS collects ongoing samples on a monthly basis. These data are then used to publish statistically adjusted estimates in 1-year, 3-year, and 5-year estimates. 1-year estimates use the most recent data but are the least reliable because the sample is smaller, whereas the 5-year estimate uses data from the previous 5 years. Although not exactly equivalent, the 1- and 5-year estimates are often considered analogous with a 1% and 5% sample of the population.
The summary tables provide the total count of persons, but are aggregated, meaning that it only shows the total number of persons in each topic, but not the combination of topics. For example, we may know the count of people by age group, gender, occupation, and household size; but we do not know the count for a particular combination of those variables, or to which household they belong. For this reason, the US Census Bureau also releases what it calls a Public Use Microdata Sample (PUMS) using sample data from the ACS.
The generalized approach to updating data within a VE package is set out below.
15.1.2 Instructions
15.1.2.1 Step 1) Gather PUMS and replace data:
In this example we will be replacing the default PUMS data in the VESimHouseholds package with your project specific local PUMS data. Based on how you obtained VisionEval navigate to the src directory. The source code for this package should be located in the VESimHouseholds directory (e.g, C:/Users/<``user`` ``name``>/Documents/VisionEval``/``src``/``VESimHouseholds
).
Packages will require the data to be in a certain format, and in this case the VESimHouseholds package requires two input data files pums_households.csv
and pums_persons.csv
.
15.1.2.1.1 A) Download PUMS data
US Census data are available from the Census’ website (https://www.census.gov/), which provides an interface to search, browse, and download Census data in a variety of formats, the most typical being Comma Separated Value (CSV) files. PUMS data can be filtered using the Census data browser, or the entire PUMS tables for States can be downloaded from the legacy FTP website: https://www2.census.gov/programs-surveys/acs/data/pums/
The files are named according to file type, (e.g., csv_), record type (“h” for household or “p” for persons), and then the State abbreviation. For example,
"csv_haz.zip"
are household PUMS data for Arizona. Additional documentation can be found here: https://www.census.gov/programs-surveys/acs/microdata/access.html
15.1.2.1.2 B) Process PUMS data.
VE was originally coded using an older PUMS file, which had slightly different field names and must be renamed. A name mapping key is in the table below:
Table name | VESimHouseholds field | New PUMS field | Description |
---|---|---|---|
pums_households.csv | SERIALNO | SERIALNO | Housing/Group Quarters Unit Serial Number |
PUMA5 | PUMA | 5% Public Use Microdata Area code | |
HWEIGHT | WGTP | Housing unit weight | |
UNITTYPE | TYPEHUGQ | Type of housing unit | |
PERSONS | NP | Number of persons living in housing unit | |
BLDGSZ | BLD | Size of Building | |
HINC | HINCP | Household Total Income in 1999 US Dollar | |
pums_persons.csv | AGE | AGEP | Age |
WRKLYR | WKL | Worked in year | |
MILITARY | MIL | In military | |
INCTOT | PINCP | Person’s total employment |
Depending on the file, other pre-processing may be required, such as removing NAs or converting categories. For example, missing NA values to 0 in HINC, shifting UNITYPE scale from {1,2,3} to {0,1,2}, or aggregating the 4-level WKL categories into 3-levels of WRKLYR. If these conversions are not made, issues may arise in the package building step.
15.1.2.2 Step 2) Package building
The critical objective of re-building a package is to build a package from the package source to the VisionEval environment. This guide uses the RStudio interface and the procedure for rebuilding a single package.
15.1.2.2.1 A) Initialize the VisionEval Environment
To start the VisionEval environment, navigate to your VisionEval runtime directory (e.g.,
C:/Users/<``user name``>/Documents/VisionEval
) and double clickVisionEval.Rproj
. The RStudio layout should look similar to the figure below (there may be minor differences):There are two options for the next step: (B1) using RStudio Build Tools, or (B2) using the R native install command. Instructions for both methods are included in steps B1 and B2 below.
15.1.2.2.2 B1) Using RStudio Build Tools
15.1.2.2.2.3 3) Install from package source
Click the “Build” drop-down from the main banner menu again. This time there will be new options, select “Install Package”.
15.1.2.2.2.4 4) Build again
After one successful build, you must run build again to ensure that the new source data files have been (1) generated and (2) the new data files have been loaded into the VisionEval package.
At this point the new data should now be imported and usable through the VESimHouseholds package. The last step is to test if the updated data is available within the VESimHouseholds package by inspecting the data using the command
VESimHouseholds``::``Hh_df
in the RStudio console.
15.1.2.2.3 B2) Using R native install command
The R command “install.packages” is used to install any R packages. The command
install.package``(“C:/Users/<user name>/Documents/VisionEval/``src``/modules/``VESimHousehold``s``”``, repos=NULL, type=“source”)
within VisionEval environment will rebuild and install VESimHouseholds package into VisionEval.
15.1.2.2.4 C) Update Dependent Packages
The final step of incorporating local PUMS data is to update the packages that have in-built estimation processes and uses the PUMS for estimating models. The PredictHousing module from VELandUse package uses PUMS to estimate housing choice model. Thus, it is important to rebuild VELandUse package after rebuilding VESimHouseholds package where the updated PUMS is now available. Follow steps B1) or B2) to rebuild VELandUse package.
Done!
15.2 Case Study 2: VEPowertrainsandFuels
There may be scenarios where we may want to study a future fleet mix (penetration of electric vehicles) that is different than the default fleet mix which comes with the VEPowertrainsandFuels package. This was the motivation behind this case study. The updates to the default fleet mix can be done by simply replacing the hh_powertrain_prop.csv
input file, similar to Case Study 1, with a version customized for the intended study. This input file needs the package to be ‘rebuilt’ in order to take effect in the VisionEval model run. The steps to rebuilding are similar to Case Study 1 and are outlined here.
The input data for the VEPowertrainsandFuels package is in the VEPowertrainsAndFuels``\``inst``\``extdata``\
directory. Each of the input files can be updated to reflect changes in the fleet makeup as well as fuel types that the vehicles use. The hh_powertrain_prop.csv
contains the proportion of household vehicles powertrain types by vehicle type and vehicle vintage year. This case study will present steps on how to update this input file. A more detailed description of the structure and content of the file can be found in the hh_powertrain_prop.``txt
file in the same directory. The figure below shows where the input file is located within the source code of the VEPowertrainsandFuels package.
15.2.1 Instructions
This case study explores the basic level of analysis needed to update the data to ensure integrity and consistency between other data components within the package. Any spreadsheet application can be used to alter the default data values and perform analysis.
This section walks users through a brief analysis that is conducted to define a modifying function and demonstrate the effects if the modifications.
15.2.1.1 Step 1) Data
VEPowertrainsAndFuels``\``inst``\``extdata``\hh_powertrain_prop.csv
are the default powertrain proportions contained in the package, which resembles the table below (the table is compressed to select years for clarity). The file’s purpose is to provide the sales by vehicle powertrain, vehicle type (auto and light trucks), and vehicle vintage year.
ModelYear | AutoPropIcev | AutoPropHev | AutoPropPhev | AutoPropBev | LtTrkPropIcev | LtTrkPropHev | LtTrkPropPhev | LtTrkPropBev |
---|---|---|---|---|---|---|---|---|
1975 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2000 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2010 | 0.8786 | 0.1213 | 0 | 0.0001 | 0.9820 | 0.0180 | 0 | 0 |
2020 | 0.8212 | 0.0788 | 0.0202 | 0.0798 | 0.9524 | 0.0143 | 0.0067 | 0.0266 |
2030 | 0.6676 | 0.0908 | 0.0358 | 0.2058 | 0.9093 | 0.0179 | 0.0106 | 0.0622 |
2040 | 0.5701 | 0.0922 | 0.0403 | 0.2974 | 0.8996 | 0.0191 | 0.0114 | 0.0698 |
2050 | 0.5198 | 0.0895 | 0.0407 | 0.3500 | 0.8916 | 0.0193 | 0.0119 | 0.0772 |
The table contains two powertrain proportions, the left-most four columns are for automobiles (i.e., AutoProp
) and the right-most are for light trucks (i.e., LtTrkProp
). Each will sum up to 1 (for a rowsum of 2).
15.2.1.2 Step 2) Analysis
Here we will conduct a brief exploratory analysis to demonstrate visually what the data look like and how they will be modified. Using standard spreadsheet application we can format and visualize the data as shown in the figure below.
We can see that battery electric vehicles (BEV), specifically automobiles, are projected to make up the majority of vehicles bought in future years. This causes the share of internal combustion engines to decline proportionally.
Let us assume that the state government is deciding whether to aggressively promote BEV cars starting in 2025. The policies cause the share of alternative powertrains (BEV, HEV, and PHEV) to increase more over time. To model this increase, we will use an arbitrary function which adds to the current value of $x$ (i.e., the proportion) at a quadratic rate.
$$
f(x) = x + (x^2) (1 - x)
$$
We use this function to adjust each of the alternative powertrains in the spreadsheet. To ensure that the proportions sum up to 1 for autos and light trucks, respectively, we then calculate the remaining proportion of ICE powertrains by subtracting the total proportion of alternative powertrains. The following figure shows the effect of increasing the share of alternative powertrain at a quadratic rate compared to default data.
We then update the existing hh_powertrain_prop.csv
file for the year 2025 and above with the newly calculated values.
15.2.1.3 Step 3) Build Package
Once the data file has been updated you will need to re-build and re-install the VEPowertrainsAndFuels package for VisionEval to use this new fleet mix data.
We can follow the instructions listed in Step 2) of the Case Study 1 to rebuild the package.
Once the package re-build is complete, your new powertrain data will be ready to use in a VisionEval model run.
15.3 Miscellaneous Information
This section contains miscellaneous information that may be useful for more advanced users.
VisionEval Package Structure
Build from command line
PUMS data processing helper scripts
Modifying package code
15.3.1 VisionEval Package Structure
The source code of VisionEval packages will generally have a structure similar to the following:
src/VEGenericPackage
├───data
│ ├─ GenericPackageSpecifications.rda
│ ├─ GenericPackage_df.rda
│ └─ GenericPackage_ls.rda
├───R
│ ├─ CreateEstimationDatasets.R
│ └─ GenericModel.R
└───inst
└─ extdata
├─ input_data1.csv
└─ input_data2.txt
inst``\``extdata
is where “external” input data sources and reference files will be placedThe
R
directory contains any R scripts used in the packages. These must be independent non-sequential scripts that do not depend on results from other scripts.data
contains the resulting data that VisionEval generates and utilizes.man
andinst``\``module_docs
, contain the markdown documentation generated during the build process.
15.3.2 Build from command line
While the GUI method is intuitive, it can be convenient to simply execute a build command from a generic R session rather than navigating menu trees in the GUI.
The fundamental command to build an r package can be run from R console using system(``"R ``CMD`` INSTALL ``package_path`` -l ``lib_path``")
. The GUI method essentially constructs this command and executes it.
package_path
is the path to the package source code you are building for e.g."C:\Users\<user name>\Documents\VisionEval\src\modules\VESimHouseholds"
. If your working directory is already located in the package, you can use “.``”
to denote the local directory.lib_path
is the runtime environment, in this case the VisionEval environment for e.g."C:\Users\<user name>\Documents\VisionEval\ve-lib"
:
Here’s an example of a command that is used to rebuild VESimHouseholds package from its source code into VisionEval.
system("R CMD INSTALL "C:\Users\<user name>\Documents\VisionEval\src\modules\VESimHouseholds" -l "C:\Users\<user name>\Documents\VisionEval\ve-lib")
15.3.3 PUMS data processing helper scripts
Processing PUMS data can be challenging for two reasons.
PUMS data evolves, with some field names and levels changing.
The 2000 PUMS are stored in a compressible serial text file structure, rather a common delimited file (e.g., CSV), making importing tedious.
Below are some helper scripts for future users to build upon:
NOTE: These may not work with all PUMS file years, operating systems, or R versions. Best effort was made to identify weak points (e.g., unzipping), but cannot be guaranteed. These scripts are meant to be a resource to you as a starting point, not a production level code.
15.3.4 PUMS File import and header processing
# IMPORTS
library(data.table)
library(tools)
# Function to process PUMS as it is read in
process_acs_pums <- function(PumsFile, type, GetPumas='ALL') {
# ACS PUMS to legacy Census PUMS fields
# Make any modifications here as necessary
meta = list(
'h' = list(
SERIALNO = list(acsname = 'SERIALNO', class ='character'),
PUMA5 = list(acsname='PUMA', class='character'),
HWEIGHT = list(acsname='WGTP', class='numeric'),
UNITTYPE = list(acsname='TYPE', class='numeric'),
PERSONS = list(acsname='NP', class='numeric'),
BLDGSZ = list(acsname='BLD', class='character'),
HINC = list(acsname='HINCP', class='numeric')
),
'p' = list(
SERIALNO = list(acsname = 'SERIALNO', class ='character'),
AGE = list(acsname='AGEP', class='numeric'),
WRKLYR = list(acsname='WKL', class='character'),
MILITARY = list(acsname='MIL', class='numeric'),
INCTOT = list(acsname='PINCP', class='numeric')
)
)
colNames <- lapply(meta, function(x) sapply(x, function(y) y[['acsname']]))
colclass <- lapply(meta, function(x) sapply(unname(x), function(y) {
setNames(y[['class']], y[['acsname']])
}))
if(Sys.info()[‘sysname’] == ‘Windows’) {
cmd <- paste0(“unzip -p ‘“, PumsFile,”’”)
}
if(Sys.info()[‘sysname’] == ‘Linux’) {
cmd <- paste0(“gunzip -cq ‘“, PumsFile,”’”)
}
# Checks if it is a zip file or a bytefile
if(grepl(‘.zip’, PumsFile)) {
df <- fread(cmd = cmd,
select = names(colclass[[type]]),
colClasses = colclass[[type]])
} else {
df <- fread(PumsFile,
select = names(colclass[[type]]),
colClasses = colclass[[type]])
}
# Rename ACS PUMS fields to match legacy Census PUMS fields
setnames(df, colNames[[type]], names(colNames[[type]]))
return(df)
}
process_2000_pums <- function(PumsFile, GetPumas='ALL') {
#Read in file and split out household and person tables
Pums_ <- readLines(PumsFile)
RecordType_ <-
as.vector(sapply(Pums_, function(x) {
substr(x, 1, 1)
}))
H_ <- Pums_[RecordType_ == "H"]
P_ <- Pums_[RecordType_ == "P"]
rm(Pums_, RecordType_, PumsFile)
#Define a function to extract specified PUMS data and put in data frame
extractFromPums <-
function(Pums_, Fields_ls) {
lapply(Fields_ls, function(x) {
x$typeFun(unlist(lapply(Pums_, function(y) {
substr(y, x$Start, x$Stop)
})))
})
}
#Identify the housing data to extract
HFields_ls <-
list(
SERIALNO = list(Start = 2, Stop = 8, typeFun = as.character),
PUMA5 = list(Start = 19, Stop = 23, typeFun = as.character),
HWEIGHT = list(Start = 102, Stop = 105, typeFun = as.numeric),
UNITTYPE = list(Start = 108, Stop = 108, typeFun = as.numeric),
PERSONS = list(Start = 106, Stop = 107, typeFun = as.numeric),
BLDGSZ = list(Start = 115, Stop = 116, typeFun = as.character),
HINC = list(Start = 251, Stop = 258, typeFun = as.numeric)
)
#Extract the housing data and clean up
H_df <- data.frame(extractFromPums(H_, HFields_ls), stringsAsFactors = FALSE)
#Extract records for desired PUMAs
if (GetPumas[1] != "ALL") {
H_df <- H_df[H_df$PUMA5 %in% GetPumas,]
}
#Identify the person data to extract
PFields_ls <-
list(
SERIALNO = list(Start = 2, Stop = 8, typeFun = as.character),
AGE = list(Start = 25, Stop = 26, typeFun = as.numeric),
WRKLYR = list(Start = 236, Stop = 236, typeFun = as.character),
MILITARY = list(Start = 138, Stop = 138, typeFun = as.numeric),
INCTOT = list(Start = 297, Stop = 303, typeFun = as.numeric)
)
#Extract the person data and clean up
P_df <- data.frame(extractFromPums(P_, PFields_ls), stringsAsFactors = FALSE)
#If not getting data for entire state, limit person records to be consistent
if (GetPumas[1] != "ALL") {
P_df <- P_df[P_df$SERIALNO %in% unique(H_df$SERIALNO),]
}
return( list('p' = P_df, 'h' = H_df) )
}
15.3.5 PUMS data web-scraping
This has been automated one step further by scraping the data and running the above functions on the files as they are read in.
# Downloads and processes legacy 2000 PUMS data
getDecPUMS <- function(STATE, output_dir = NA){
#VARS
state_codes <- fread('state.txt')
state_codes <- setNames(state_codes$STATE, state_codes$STUSAB)
base_url = 'https://www2.census.gov/census_2000/datasets/PUMS/FivePercent'
if(length(STATE) > 2 & !is.numeric(STATE)) {
STATE <- state.abb[match(toTitleCase(STATE),state.name)]
}
STATE_NAME <- state.name[match(toupper(STATE),state.abb)]
if(!is.numeric(STATE)) STATE_NUM <- state_codes[toupper(STATE)]
# Download the PUMS data to tempfile and load directly to data table
url <- file.path(base_url,
STATE_NAME,
paste0(‘REVISEDPUMS5_’, sprintf(“%02d”, STATE_NUM), ‘.TXT’))
temp <- tempfile()
download.file(url, temp)
# Read .txt to data frames
PUMS <- process_2000_pums(temp)
# SAVE OUTPUT
if(!is.na(output_dir)) {
if(!dir.exists(output_dir)) dir.create(output_dir)
fwrite(PUMS[['p']], file.path(output_dir, 'pums_persons.csv'))
fwrite(PUMS[['h']], file.path(output_dir, 'pums_households.csv'))
} else {
return(PUMS)
}
}
# Downloads and processes post-2000 PUMS
getACSPUMS <- function(STATE, YEAR='2000', GetPumas='ALL', output_dir, save_zip = T){
#VARS
try({
state_codes <- fread('state.txt')
state_codes <- setNames(state_codes$STATE, state_codes$STUSAB)
})
base_url = 'https://www2.census.gov/programs-surveys/acs/data/pums'
if(length(STATE) > 2 & !is.numeric(STATE)) {
STATE <- tolower(state.abb[match(toTitleCase(STATE),state.name)])
}
# Download the PUMS data to tempfile and load directly to data table
PUMS <- lapply(c(‘p’, ‘h’), function(f) {
url <- file.path(base_url, YEAR, ‘5-Year’,
paste0(‘csv_’, f, tolower(STATE), ‘.zip’))
if(save_zip == F){
temp <- tempfile()
} else {
temp <- file.path(output_dir, basename(url))
}
download.file(url, temp)
df <- process_acs_pums(temp, type=f, GetPumas)
return(df)
})
names(PUMS) <- c('p', 'h')
# SAVE OUTPUT
if(!is.na(output_dir)) {
if(!dir.exists(output_dir)) dir.create(output_dir)
fwrite(PUMS[[‘p’]], file.path(output_dir, ‘pums_persons.csv’))
fwrite(PUMS[[‘h’]], file.path(output_dir, ‘pums_households.csv’))
} else {
return(PUMS)
}
}