Using Available, Accessible (and Free) Federal Data in Your Research
David M
Mannino, M.D.
Air Pollution and Respiratory Health Branch
Centers for Disease Control and Prevention
Keys to Using Federal Data
- Defining the Research Question
- Background
- Hypothesis
- Defining the study population
- Defining the predictor and outcome variables
- Defining the covariates/confounders
- Random samples vs. complex samples
- Weights
- Response rate issues
- Limitations of the databases
- Relative standard errors
- Cell sizes - Identifying what types of data could be used to answer your
research question:
- Cross sectional data
- Surveillance data
- Longitudinal data- Usually historical cohorts
- The main CDC web site
- Main data sets stored here (http://www.cdc.gov) - CDC's Wonder System
- Single point of access to numeric public health data (http://www.cdc.gov/nchs/datawh/cdcwond/cdcwond.htm)
Data Examples on CDC Wonder
- Behavioral Risk Factor Surveillance System (BRFSS)
http://www.cdc.gov/brfss/index.htm - National Occupational Exposure Survey (NOES)
http://www.cdc.gov/noes/default.html - Cancer: Surveillance, Epidemiology and End Results
http://SEER.Cancer.Gov
- High Speed Internet Access Helpful
- Many datasets can be ordered on CDs - Analytic Software
- SAS, SUDAAN are the standards
- SPSS and STATA can also be used
Some Examples
Pulmonary Function and Lung Cancer
- Research Question — Are people with low lung function more likely to develop lung cancer- even after adjusting for smoking duration and intensity?
- Database needed — Longitudinal cohort
NHANES I
- Nationally representative survey conducted by the National Center for Health Statistics
- Consisted of extensive questionnaire, physical examination and laboratory testing
- 14,407 adult participants (25-74 years old)
- 6,913 adults participated in cardiorespiratory survey (also nationally representative), which included pulmonary function testing
NHANES I - Spirometry
- Spirometry done using Ohio Medical Instruments 800 spirometer
- 5,542 of the subjects who participated in the cardiorespiratory
exam completed spirometry
- 1,371 subjects not completing spirometry more likely the be > 60 years old (38% vs. 23%, p < 0.01) and of nonwhite race (22% vs. 12%, p < 0.01) than subjects in final cohort
- 140 subjects with cancer at baseline also excluded, resulting in 5,402 subjects in final cohort
Covariates Used in Analysis
- Age
- Race
- Sex
- Education level (= 12 or = 13 years)
- Smoking status (Current, Former, Never)
- Pack years of smoking (< 30, 30 —59, = 60)
- Years since quitting smoking (0, < 10, = 10)
Unadjusted Kaplan-Meier Curve for
Incident Lung Cancer

- Research Question — Are patients with cirrhosis more likely to
develop sepsis?
- Database needed — large cohort
- Modified Research Question — Are patients hospitalized with
cirrhosis more likely than all hospitalized patients to have sepsis
and complications?
- Database needed — Hospitalization sample
National Hospital Discharge Survey
- Contains data from about 300,000 hospital discharges annually
- About 1.5 million records over 1995-1999
- Weighted to reflect the 175 million discharges in the US over 1995-1999
Cirrhosis and Sepsis — Analytic Plan
- Search for the following diagnoses in all coded fields
- Sepsis (ICD-9 38.0-38.9)
- Cirrhosis (ICD-9 571)
- Respiratory Failure (ICD-9 518.82, 518.5) - Compare proportions of death, sepsis or respiratory failure in cirrhosis related hospitalizations with all hospitalizations
Cirrhosis and Sepsis — Results
- 1.7 million cirrhosis related hospitalizations over 1995-1999
- 128,000 deaths (7.5%)
- 78,000 with sepsis (4.6%)
- 81,000 with respiratory failure (4.7%) - Age-adjusted risk ratios
- Death 2.7 (95% C.I. 2.3, 3.1)
- Sepsis 2.6 (95% C.I. 1.9, 3.3)
- Respiratory Failure 1.4 (95% C.I. 1.1, 1.8)



