#imports
import pandas as pd
#establish informative names for files via dictionary
c2_dict = {
"L06NB_C.xpt": "folate_b12",
"L06MH_C.xpt": "hcy_mma",
"L40_C.xpt": "alt_ast_ggt_glucose",
"BMX_C.xpt": "bmi",
"DEMO_C.xpt": "demo",
"DIQ_C.xpt": "diabetes",
"L11_C.xpt": "crp",
"L06TFR_C.xpt": "iron"
}
#loop for c2 conversion and export
for raw_file in c2_dict:
df = pd.read_sas("../data/raw_data/" + raw_file, format="xport")
df.to_csv("../data/clean_data/c2_" + c2_dict[raw_file] + ".csv", index=False) 2.1: Data Preprocessing
2.1.0: Database Preprocessing Overview
Preprocessing for Part 2 was conducted using a primarily PostgreSQL staged pipeline prior to downstream modeling in Python and visualization in R.
The NHANES data used was obtained as multiple files in SAS transport (.XPT) format, which required an additional CSV conversion step before data importation. After conversion, all data preprocessing, cleaning, organization, and extraction were performed in PostgreSQL using a multi-schema approach for reproducibility and transparency.
The overall Part 2 data pipeline follows:
NHANES (.XPT) → Python conversion → CSV → PostgreSQL (raw → staging → final) → Python (ML) → R (visualization)
It is followed by discussion-based integration of both project components.
2.1.1: Data Source & Acquisition
The data used for this portion of the project comes from 2001-2002 and 2003–2004 cohorts of the National Health and Nutrition Examination Survey (NHANES) from the CDC.
Only two cohorts were selected due to data limitations. Together, these cohorts represent the only continuous publicly available NHANES window containing a complete, harmonized one-carbon biomarker panel (homocysteine, serum/RBC folate, B12, MMA, ferritin) alongside liver, metabolic, and inflammatory markers necessary for the proposed integrative analysis. It’s important to note that while betaine was mentioned in section 1.3.2.2 as a biomarker with mechanistically relevant links to both C677T and hepatic function, it could not be included in this portion of the analysis due to missingness within the NHANES cohorts.
The following files were identified as necessary for usable dataset construction. Each file contains at least one biological or survey component of interest for the analysis, and will require integration to form the complete dataset.
Files needed per cohort:
2001–2002:
DEMO_B.XPT (demographics, sex, age, weight)
BMX_B.XPT (BMI)
DIQ_B.XPT (diabetes)
L06_2_B.XPT (serum and RBC folate, B12, homocysteine, MMA, ferritin)
L40_B.XPT (ALT, AST, GGT, glucose)
L11_B.XPT (CRP)
2003–2004:
DEMO_C.XPT (demographics, sex, age, weight)
BMX_C.XPT (BMI)
DIQ_C.XPT (diabetes)
L06NB_C.XPT (serum and RBC folate, B12)
L06MH_C.XPT (homocysteine, MMA)
L40_C.XPT (ALT, AST, GGT, glucose)
L11_C.XPT (CRP)
L06TFR_C.XPT (ferritin)
Files for fasting-dependent variables were not included, as they weren’t necessary for this analysis.
2.1.2: File Conversion & Ingestion Pipeline
NHANES data files are stored online in SAS transport (.XPT) format. PgAdmin cannot read .XPT files, so the files needed to be downloaded locally and converted into CSVs prior to database development and schema population.
A Python helper script (located at ‘scripts/part2/python/xpt_to_csv.py’ in the project remote repository), was used to convert all files and write their outputs into renamed CSVs stored in a clean_data directory.
An example of the conversion code is included below:
2.1.3: Database Architecture
This project’s database infrastructure used a three-schema approach:
- raw schema: for importing & converting data
- staging schema: for preprocessing & joining
- final schema: for final table creation & export
This structure allows for clear distinction between unprocessed data, intermediate adjustments, and analysis-ready outputs.
CREATE schema IF NOT EXISTS raw;
CREATE schema IF NOT EXISTS staging;
CREATE schema IF NOT EXISTS final;2.1.4: Raw Table Creation & Loading
The unabridged pre-processing pipeline for data structuring found in the project repository included the files:
create_raw_tables.sql
raw_load_copy_c1.sql
raw_load_copy_c2.sql
This created empty tables for each NHANES CSV imported file. All variables were initially read in as raw text in order to avoid importation and/or missingness errors. After the tables were created, each CSV file is loaded into the raw schema’s new, empty tables using COPY commands.
-- c2_crp table set up
DROP TABLE IF EXISTS raw.c2_crp;
CREATE TABLE raw.c2_crp (
SEQN TEXT,
LBXCRP TEXT,
LBXBAP TEXT,
LBXPT21 TEXT
);
-- c2 crp table population
COPY raw.c2_crp
FROM '/Users/coletterouiller/capstone/data/clean_data/c2_crp.csv'
WITH (FORMAT CSV, HEADER true);2.1.5: Cleaning & Staging
Cleaning and preprocessing were performed in the staging schema using the files:
staging_c1.sql
staging_c2.sql
Issue 1: Cohort Size Collapse
The initial organization plan was to start with the complete overlap of biomarkers and then join additional necessary or related variables through LEFT JOINs in order to retain maximum observation counts. However, initial efforts to define cohorts using complete biomarker overlap resulted in significant sample size reduction. The full biomarker panels collapsed into 551 and ~9000 for cohorts 1 and 2, respectively. To avoid this dramatic collapse, the cohort construction approach was revised to anchoring the dataset on the liver/metabolic file. One of the primary drivers of this decision was the project emphasis on population patterns/overall trends of liver disease, and the hepatic file (c1_alt_ast_ggt_glucose file) contained ~15,000 data points. Additionally, a small sub-analysis of the available complete metabolic panel (n=551) was retained in order to account for the chosen markers without defining the entire cohort at such a restrictive scale.
-- part 2: FLD, so anchor on liver biomarkers
CREATE TABLE c1_clean AS
SELECT
CAST(CAST(c1_alt_ast_ggt_glucose.SEQN AS numeric) AS integer) AS seqn,
-- [full column list continues — see staging-casting chunk below for additional examples]
FROM raw.c1_alt_ast_ggt_glucose
-- now that vars are retrieved, left join on SEQN
LEFT JOIN raw.c1_demo
USING (SEQN)
LEFT JOIN raw.c1_bmi
USING (SEQN)
LEFT JOIN raw.c1_diabetes
USING (SEQN)
LEFT JOIN raw.c1_crp
USING (SEQN)
LEFT JOIN raw.c1_folate_b12_hcy_mma_iron
USING (SEQN);Issue 2: Type Casting
During the .XPT to CSV conversion process, integer variables were automatically converted to floats by pandas regardless of downstream formatting specifications. A nested recasting approach was then used to correct this, in which all affected variables were first cast from text to numeric. Then, based on variable data type, variables were recast from numeric to integer data type when appropriate. This ensured datatype assignment was correct for downstream joins and analysis.
-- set path
SET search_path TO staging;
-- create table
CREATE TABLE c1_clean AS
SELECT
CAST(CAST(c1_alt_ast_ggt_glucose.SEQN AS numeric) AS integer) AS seqn,
-- demographics vars to include
CAST(CAST(NULLIF(c1_demo.RIDAGEYR, '') AS numeric) AS integer) AS age_year,
CAST(CAST(NULLIF(c1_demo.RIAGENDR, '') AS numeric) AS integer) AS sex,
CAST(CAST(NULLIF(c1_demo.RIDRETH1, '') AS numeric) AS integer) AS race_ethnicity,
CAST(NULLIF(c1_demo.INDFMPIR, '') AS numeric) AS income_ratio,
-- demo weights to include
CAST(NULLIF(c1_demo.WTMEC2YR, '') AS numeric) AS exam_weight,
CAST(CAST(NULLIF(c1_demo.SDMVPSU, '') AS numeric) AS integer) AS survey_psu,
CAST(CAST(NULLIF(c1_demo.SDMVSTRA, '') AS numeric) AS integer) AS survey_strataIssue 3: Duplication
During standard data checks and manual observation practices, duplicate SEQN values were observed. This initially appeared to indicate a join duplication error(s) of some kind had occurred and required follow-up investigation to ensure data integrity. Debugging revealed multiple raw tables contained duplicate rows due to import issues.
--looking for duplication after manually catching a potential duplication event at raw layer
SELECT
CAST(CAST(seqn AS numeric) AS integer) AS seqn_clean,
COUNT(*) AS n_rows
FROM raw.c2_diabetes
GROUP BY seqn_clean
HAVING COUNT(*) > 1
ORDER BY n_rows DESC, seqn_clean
LIMIT 20;
--looking for duplication after manually catching a potential duplication event at staging layer
SELECT seqn, COUNT(*) AS n_rows
FROM staging.c1_clean
GROUP BY seqn
HAVING COUNT(*) > 1
ORDER BY n_rows DESC, seqn
LIMIT 20;This was resolved by identifying the raw tables, correcting the duplication at the raw layer and then rebuilding and repopulating the affected tables.
Final verification before cohort merging confirmed an absence of duplicates in raw and staging tables, as well as a single row per individual in final outputs.
2.1.6: Cohort Merging
Cohorts were merged using UNION ALL instead of UNION to stack rows across both cohorts without duplicating participants. These operations were performed across all relevant component tables, with the final cohort structure prioritizing maximal participant inclusion, preservation of missingness, and flexibility for downstream filtering in Python and R.
Missing biomarker values were retained and addressed during analysis rather than being removed during preprocessing, given the uneven biomarker coverage established across cohorts.
-- use UNION ALL not UNION bc I am stacking, not de-duplicating
SELECT seqn, 'c1_2001_2002' AS cohort, age_year, sex, race_ethnicity,
ast, alt, ggt, glucose, homocysteine, rbc_folate, serum_folate,
vit_b12, mm_acid, ferritin, crp, bmi, diabetes
FROM staging.c1_clean
UNION ALL
SELECT seqn, 'c2_2003_2004' AS cohort, age_year, sex, race_ethnicity,
ast, alt, ggt, glucose, homocysteine, rbc_folate, serum_folate,
vit_b12, mm_acid, ferritin, crp, bmi, diabetes
FROM staging.c2_clean;2.1.7: Final Analysis-Ready Export
The final schema was used to construct the analysis-ready table for the ML modeling portion that follows.
Key transformations include:
collapsing all joined tables into a single dataset
deriving new variables:
diabetes_binary (0 = no, 1 = yes)
a cohort indicator
ast_alt_ratio that set to NULL if either value is missing
-- derived variables created at final schema stage
CASE
WHEN diabetes = 1 THEN 1
WHEN diabetes = 2 THEN 0
ELSE NULL
END AS diabetes_binary,
'c1_2001_2002' AS cohort,
CASE
WHEN ast IS NOT NULL AND alt IS NOT NULL
THEN ROUND(ast / NULLIF(alt, 0), 4)
ELSE NULL
END AS ast_alt_ratioFinal QC checks confirmed there was only one row per participant, there was consistent and plausible coverage counts with expected variation in biomarker availability across cohorts.
--QC check - c1
SELECT COUNT(*) AS total_rows,
COUNT(DISTINCT seqn) AS distinct_seqn
FROM staging.c1_clean;
--QC check - c2
SELECT
COUNT(*) AS total_rows,
COUNT(DISTINCT seqn) AS distinct_seqn
FROM staging.c2_clean;
--biomarker coverage check -c1
SELECT
COUNT(*) AS total_rows,
COUNT(ast) AS n_ast,
COUNT(alt) AS n_alt,
COUNT(ggt) AS n_ggt,
COUNT(glucose) AS n_glucose,
COUNT(rbc_folate) AS n_rbc_folate,
COUNT(vit_b12) AS n_vit_b12,
COUNT(serum_folate) AS n_serum_folate,
COUNT(homocysteine) AS n_homocysteine,
COUNT(mm_acid) AS n_mma,
COUNT(ferritin) AS n_ferritin,
COUNT(crp) AS n_crp,
COUNT(bmi) AS n_bmi,
COUNT(diabetes) AS n_diabetes
FROM staging.c1_clean;
--biomarker coverage check - c2
SELECT
COUNT(*) AS total_rows,
COUNT(ast) AS n_ast,
COUNT(alt) AS n_alt,
COUNT(ggt) AS n_ggt,
COUNT(glucose) AS n_glucose,
COUNT(rbc_folate) AS n_rbc_folate,
COUNT(vit_b12) AS n_vit_b12,
COUNT(serum_folate) AS n_serum_folate,
COUNT(homocysteine) AS n_homocysteine,
COUNT(mm_acid) AS n_mma,
COUNT(ferritin) AS n_ferritin,
COUNT(crp) AS n_crp,
COUNT(bmi) AS n_bmi,
COUNT(diabetes) AS n_diabetes
FROM staging.c2_clean;It’s worth noting that cohort 2’s ferritin displayed lower coverage than other biomarkers, though that may be attributable to ferritin being stored in a separate source file (L06TFR_C.XPT), unlike in cohort 1. In addition, differences in biomarker availability across cohorts were retained and accounted for in downstream analysis.