| | More

Using Available, Accessible (and Free) Federal Data in Your Research


David M Mannino, M.D.

Air Pollution and Respiratory Health Branch
Centers for Disease Control and Prevention

Keys to Using Federal Data

Having a plan

  • Defining the Research Question
  • Background
  • Hypothesis
  • Defining the study population
  • Defining the predictor and outcome variables
  • Defining the covariates/confounders

Understanding the Data sets

  • Random samples vs. complex samples
    - Weights
  • Response rate issues
  • Limitations of the databases
    - Relative standard errors
    - Cell sizes
  • Identifying what types of data could be used to answer your research question:
    - Cross sectional data
    - Surveillance data
    - Longitudinal data
    • Usually historical cohorts

Finding the Data

Data Examples on CDC Wonder

Picking the Right Tools

  • High Speed Internet Access Helpful
    - Many datasets can be ordered on CDs
  • Analytic Software
    - SAS, SUDAAN are the standards
    - SPSS and STATA can also be used

Some Examples

Pulmonary Function and Lung Cancer

  • Research Question — Are people with low lung function more likely to develop lung cancer- even after adjusting for smoking duration and intensity?
  • Database needed — Longitudinal cohort

NHANES I

  • Nationally representative survey conducted by the National Center for Health Statistics
  • Consisted of extensive questionnaire, physical examination and laboratory testing
  • 14,407 adult participants (25-74 years old)
    - 6,913 adults participated in cardiorespiratory survey (also nationally representative), which included pulmonary function testing

NHANES I - Spirometry

  • Spirometry done using Ohio Medical Instruments 800 spirometer
  • 5,542 of the subjects who participated in the cardiorespiratory exam completed spirometry
    - 1,371 subjects not completing spirometry more likely the be > 60 years old (38% vs. 23%, p < 0.01) and of nonwhite race (22% vs. 12%, p < 0.01) than subjects in final cohort
    - 140 subjects with cancer at baseline also excluded, resulting in 5,402 subjects in final cohort

Covariates Used in Analysis

  • Age
  • Race
  • Sex
  • Education level (= 12 or = 13 years)
  • Smoking status (Current, Former, Never)
  • Pack years of smoking (< 30, 30 —59, = 60)
  • Years since quitting smoking (0, < 10, = 10)

Unadjusted Kaplan-Meier Curve for Incident Lung Cancer


Cirrhosis and Sepsis

  • Research Question — Are patients with cirrhosis more likely to develop sepsis?
    - Database needed — large cohort
  • Modified Research Question — Are patients hospitalized with cirrhosis more likely than all hospitalized patients to have sepsis and complications?
    - Database needed — Hospitalization sample

National Hospital Discharge Survey

  • Contains data from about 300,000 hospital discharges annually
  • About 1.5 million records over 1995-1999
    - Weighted to reflect the 175 million discharges in the US over 1995-1999

Cirrhosis and Sepsis — Analytic Plan

  • Search for the following diagnoses in all coded fields
    - Sepsis (ICD-9 38.0-38.9)
    - Cirrhosis (ICD-9 571)
    - Respiratory Failure (ICD-9 518.82, 518.5)
  • Compare proportions of death, sepsis or respiratory failure in cirrhosis related hospitalizations with all hospitalizations

Cirrhosis and Sepsis — Results

  • 1.7 million cirrhosis related hospitalizations over 1995-1999
    - 128,000 deaths (7.5%)
    - 78,000 with sepsis (4.6%)
    - 81,000 with respiratory failure (4.7%)
  • Age-adjusted risk ratios
    - Death 2.7 (95% C.I. 2.3, 3.1)
    - Sepsis 2.6 (95% C.I. 1.9, 3.3)
    - Respiratory Failure 1.4 (95% C.I. 1.1, 1.8)