2.1: Data Preprocessing

2.1.0: Database Preprocessing Overview

Preprocessing for Part 2 was conducted using a primarily PostgreSQL staged pipeline prior to downstream modeling in Python and visualization in R.

The NHANES data used was obtained as multiple files in SAS transport (.XPT) format, which required an additional CSV conversion step before data importation. After conversion, all data preprocessing, cleaning, organization, and extraction were performed in PostgreSQL using a multi-schema approach for reproducibility and transparency.

The overall Part 2 data pipeline follows:

NHANES (.XPT) → Python conversion → CSV → PostgreSQL (raw → staging → final) → Python (ML) → R (visualization)

It is followed by discussion-based integration of both project components.

2.1.1: Data Source & Acquisition

The data used for this portion of the project comes from 2001-2002 and 2003–2004 cohorts of the National Health and Nutrition Examination Survey (NHANES) from the CDC.

Only two cohorts were selected due to data limitations. Together, these cohorts represent the only continuous publicly available NHANES window containing a complete, harmonized one-carbon biomarker panel (homocysteine, serum/RBC folate, B12, MMA, ferritin) alongside liver, metabolic, and inflammatory markers necessary for the proposed integrative analysis. It’s important to note that while betaine was mentioned in section 1.3.2.2 as a biomarker with mechanistically relevant links to both C677T and hepatic function, it could not be included in this portion of the analysis due to missingness within the NHANES cohorts.

The following files were identified as necessary for usable dataset construction. Each file contains at least one biological or survey component of interest for the analysis, and will require integration to form the complete dataset.

Files needed per cohort:

2001–2002:

DEMO_B.XPT (demographics, sex, age, weight)
BMX_B.XPT (BMI)
DIQ_B.XPT (diabetes)
L06_2_B.XPT (serum and RBC folate, B12, homocysteine, MMA, ferritin)
L40_B.XPT (ALT, AST, GGT, glucose)
L11_B.XPT (CRP)

2003–2004:

DEMO_C.XPT (demographics, sex, age, weight)
BMX_C.XPT (BMI)
DIQ_C.XPT (diabetes)
L06NB_C.XPT (serum and RBC folate, B12)
L06MH_C.XPT (homocysteine, MMA)
L40_C.XPT (ALT, AST, GGT, glucose)
L11_C.XPT (CRP)
L06TFR_C.XPT (ferritin)

Files for fasting-dependent variables were not included, as they weren’t necessary for this analysis.

2.1.2: File Conversion & Ingestion Pipeline

NHANES data files are stored online in SAS transport (.XPT) format. PgAdmin cannot read .XPT files, so the files needed to be downloaded locally and converted into CSVs prior to database development and schema population.

A Python helper script (located at ‘scripts/part2/python/xpt_to_csv.py’ in the project remote repository), was used to convert all files and write their outputs into renamed CSVs stored in a clean_data directory.

An example of the conversion code is included below:

#imports 
import pandas as pd 

#establish informative names for files via dictionary
c2_dict = {
    "L06NB_C.xpt":   "folate_b12",
    "L06MH_C.xpt":   "hcy_mma",
    "L40_C.xpt":   "alt_ast_ggt_glucose",
    "BMX_C.xpt":   "bmi",
    "DEMO_C.xpt":  "demo",
    "DIQ_C.xpt":   "diabetes",
    "L11_C.xpt":   "crp",
    "L06TFR_C.xpt":   "iron"
}

#loop for c2 conversion and export 
for raw_file in c2_dict: 
    df = pd.read_sas("../data/raw_data/" + raw_file, format="xport")
    df.to_csv("../data/clean_data/c2_" + c2_dict[raw_file] + ".csv", index=False)

2.1.3: Database Architecture

This project’s database infrastructure used a three-schema approach:

raw schema: for importing & converting data
staging schema: for preprocessing & joining
final schema: for final table creation & export

This structure allows for clear distinction between unprocessed data, intermediate adjustments, and analysis-ready outputs.


CREATE schema IF NOT EXISTS raw;
CREATE schema IF NOT EXISTS staging;
CREATE schema IF NOT EXISTS final;

2.1.4: Raw Table Creation & Loading

The unabridged pre-processing pipeline for data structuring found in the project repository included the files:

create_raw_tables.sql
raw_load_copy_c1.sql
raw_load_copy_c2.sql

This created empty tables for each NHANES CSV imported file. All variables were initially read in as raw text in order to avoid importation and/or missingness errors. After the tables were created, each CSV file is loaded into the raw schema’s new, empty tables using COPY commands.

-- c2_crp table set up
DROP TABLE IF EXISTS raw.c2_crp;

CREATE TABLE raw.c2_crp (
  SEQN TEXT,
  LBXCRP TEXT,
  LBXBAP TEXT,
  LBXPT21 TEXT
);

-- c2 crp table population 
COPY raw.c2_crp
FROM '/Users/coletterouiller/capstone/data/clean_data/c2_crp.csv'
WITH (FORMAT CSV, HEADER true);

2.1.5: Cleaning & Staging

Cleaning and preprocessing were performed in the staging schema using the files:

staging_c1.sql
staging_c2.sql

Issue 1: Cohort Size Collapse

The initial organization plan was to start with the complete overlap of biomarkers and then join additional necessary or related variables through LEFT JOINs in order to retain maximum observation counts. However, initial efforts to define cohorts using complete biomarker overlap resulted in significant sample size reduction. The full biomarker panels collapsed into 551 and ~9000 for cohorts 1 and 2, respectively. To avoid this dramatic collapse, the cohort construction approach was revised to anchoring the dataset on the liver/metabolic file. One of the primary drivers of this decision was the project emphasis on population patterns/overall trends of liver disease, and the hepatic file (c1_alt_ast_ggt_glucose file) contained ~15,000 data points. Additionally, a small sub-analysis of the available complete metabolic panel (n=551) was retained in order to account for the chosen markers without defining the entire cohort at such a restrictive scale.

-- part 2: FLD, so anchor on liver biomarkers
CREATE TABLE c1_clean AS
SELECT
        CAST(CAST(c1_alt_ast_ggt_glucose.SEQN AS numeric) AS integer) AS seqn,
        -- [full column list continues — see staging-casting chunk below for additional examples]
FROM raw.c1_alt_ast_ggt_glucose 
-- now that vars are retrieved, left join on SEQN
LEFT JOIN raw.c1_demo 
    USING (SEQN)
LEFT JOIN raw.c1_bmi 
    USING (SEQN)
LEFT JOIN raw.c1_diabetes 
    USING (SEQN)
LEFT JOIN raw.c1_crp 
    USING (SEQN)
LEFT JOIN raw.c1_folate_b12_hcy_mma_iron 
    USING (SEQN);

Issue 2: Type Casting

During the .XPT to CSV conversion process, integer variables were automatically converted to floats by pandas regardless of downstream formatting specifications. A nested recasting approach was then used to correct this, in which all affected variables were first cast from text to numeric. Then, based on variable data type, variables were recast from numeric to integer data type when appropriate. This ensured datatype assignment was correct for downstream joins and analysis.

-- set path
SET search_path TO staging;

-- create table
CREATE TABLE c1_clean AS
SELECT
        CAST(CAST(c1_alt_ast_ggt_glucose.SEQN AS numeric) AS integer) AS seqn,

    -- demographics vars to include
    CAST(CAST(NULLIF(c1_demo.RIDAGEYR, '') AS numeric) AS integer) AS age_year,
    CAST(CAST(NULLIF(c1_demo.RIAGENDR, '') AS numeric) AS integer) AS sex,
    CAST(CAST(NULLIF(c1_demo.RIDRETH1, '') AS numeric) AS integer) AS race_ethnicity,
    CAST(NULLIF(c1_demo.INDFMPIR, '') AS numeric) AS income_ratio,

        -- demo weights to include
    CAST(NULLIF(c1_demo.WTMEC2YR, '') AS numeric) AS exam_weight,
    CAST(CAST(NULLIF(c1_demo.SDMVPSU, '') AS numeric) AS integer) AS survey_psu,
    CAST(CAST(NULLIF(c1_demo.SDMVSTRA, '') AS numeric) AS integer) AS survey_strata

Issue 3: Duplication

During standard data checks and manual observation practices, duplicate SEQN values were observed. This initially appeared to indicate a join duplication error(s) of some kind had occurred and required follow-up investigation to ensure data integrity. Debugging revealed multiple raw tables contained duplicate rows due to import issues.

--looking for duplication after manually catching a potential duplication event at raw layer
SELECT
    CAST(CAST(seqn AS numeric) AS integer) AS seqn_clean,
    COUNT(*) AS n_rows
FROM raw.c2_diabetes
GROUP BY seqn_clean
HAVING COUNT(*) > 1
ORDER BY n_rows DESC, seqn_clean
LIMIT 20;

--looking for duplication after manually catching a potential duplication event at staging layer
SELECT seqn, COUNT(*) AS n_rows
FROM staging.c1_clean
GROUP BY seqn
HAVING COUNT(*) > 1
ORDER BY n_rows DESC, seqn
LIMIT 20;

This was resolved by identifying the raw tables, correcting the duplication at the raw layer and then rebuilding and repopulating the affected tables.

Final verification before cohort merging confirmed an absence of duplicates in raw and staging tables, as well as a single row per individual in final outputs.

2.1.6: Cohort Merging

Cohorts were merged using UNION ALL instead of UNION to stack rows across both cohorts without duplicating participants. These operations were performed across all relevant component tables, with the final cohort structure prioritizing maximal participant inclusion, preservation of missingness, and flexibility for downstream filtering in Python and R.

Missing biomarker values were retained and addressed during analysis rather than being removed during preprocessing, given the uneven biomarker coverage established across cohorts.

-- use UNION ALL not UNION bc I am stacking, not de-duplicating
SELECT seqn, 'c1_2001_2002' AS cohort, age_year, sex, race_ethnicity,
       ast, alt, ggt, glucose, homocysteine, rbc_folate, serum_folate,
       vit_b12, mm_acid, ferritin, crp, bmi, diabetes
FROM staging.c1_clean

UNION ALL

SELECT seqn, 'c2_2003_2004' AS cohort, age_year, sex, race_ethnicity,
       ast, alt, ggt, glucose, homocysteine, rbc_folate, serum_folate,
       vit_b12, mm_acid, ferritin, crp, bmi, diabetes
FROM staging.c2_clean;

2.1.7: Final Analysis-Ready Export

The final schema was used to construct the analysis-ready table for the ML modeling portion that follows.

Key transformations include:

collapsing all joined tables into a single dataset
deriving new variables:
- diabetes_binary (0 = no, 1 = yes)
- a cohort indicator
- ast_alt_ratio that set to NULL if either value is missing

-- derived variables created at final schema stage
CASE
    WHEN diabetes = 1 THEN 1
    WHEN diabetes = 2 THEN 0
    ELSE NULL
END AS diabetes_binary,

'c1_2001_2002' AS cohort,

CASE
    WHEN ast IS NOT NULL AND alt IS NOT NULL
    THEN ROUND(ast / NULLIF(alt, 0), 4)
    ELSE NULL
END AS ast_alt_ratio

Final QC checks confirmed there was only one row per participant, there was consistent and plausible coverage counts with expected variation in biomarker availability across cohorts.

--QC check - c1
SELECT COUNT(*) AS total_rows,
       COUNT(DISTINCT seqn) AS distinct_seqn
FROM staging.c1_clean;


--QC check - c2
SELECT
    COUNT(*) AS total_rows,
    COUNT(DISTINCT seqn) AS distinct_seqn
FROM staging.c2_clean;

--biomarker coverage check -c1
SELECT
    COUNT(*) AS total_rows,
    COUNT(ast) AS n_ast,
    COUNT(alt) AS n_alt,
    COUNT(ggt) AS n_ggt,
    COUNT(glucose) AS n_glucose,
    COUNT(rbc_folate) AS n_rbc_folate,
    COUNT(vit_b12) AS n_vit_b12,
    COUNT(serum_folate) AS n_serum_folate,
    COUNT(homocysteine) AS n_homocysteine,
    COUNT(mm_acid) AS n_mma,
    COUNT(ferritin) AS n_ferritin,
    COUNT(crp) AS n_crp,
    COUNT(bmi) AS n_bmi,
    COUNT(diabetes) AS n_diabetes
FROM staging.c1_clean;

--biomarker coverage check - c2
SELECT
    COUNT(*) AS total_rows,
    COUNT(ast) AS n_ast,
    COUNT(alt) AS n_alt,
    COUNT(ggt) AS n_ggt,
    COUNT(glucose) AS n_glucose,
    COUNT(rbc_folate) AS n_rbc_folate,
    COUNT(vit_b12) AS n_vit_b12,
    COUNT(serum_folate) AS n_serum_folate,
    COUNT(homocysteine) AS n_homocysteine,
    COUNT(mm_acid) AS n_mma,
    COUNT(ferritin) AS n_ferritin,
    COUNT(crp) AS n_crp,
    COUNT(bmi) AS n_bmi,
    COUNT(diabetes) AS n_diabetes
FROM staging.c2_clean;

It’s worth noting that cohort 2’s ferritin displayed lower coverage than other biomarkers, though that may be attributable to ferritin being stored in a separate source file (L06TFR_C.XPT), unlike in cohort 1. In addition, differences in biomarker availability across cohorts were retained and accounted for in downstream analysis.

--- title: "2.1: Data Preprocessing" format: html: css: styles.css code-fold: false code-summary: "Show code" editor: visual execute: echo: true output: true toc-depth: 4 --- ```{r} #| include: false library(DBI) library(RSQLite) con <- dbConnect(RSQLite::SQLite(), ":memory:") ``` #### 2.1.0: Database Preprocessing Overview Preprocessing for Part 2 was conducted using a primarily PostgreSQL staged pipeline prior to downstream modeling in Python and visualization in R. The NHANES data used was obtained as multiple files in SAS transport (.XPT) format, which required an additional CSV conversion step before data importation. After conversion, all data preprocessing, cleaning, organization, and extraction were performed in PostgreSQL using a multi-schema approach for reproducibility and transparency. The overall Part 2 data pipeline follows: NHANES (.XPT) → Python conversion → CSV → PostgreSQL (raw → staging → final) → Python (ML) → R (visualization) It is followed by discussion-based integration of both project components. #### 2.1.1: Data Source & Acquisition The data used for this portion of the project comes from 2001-2002 and 2003–2004 cohorts of the National Health and Nutrition Examination Survey (NHANES) from the CDC. Only two cohorts were selected due to data limitations. Together, these cohorts represent the only continuous publicly available NHANES window containing a complete, harmonized one-carbon biomarker panel (homocysteine, serum/RBC folate, B12, MMA, ferritin) alongside liver, metabolic, and inflammatory markers necessary for the proposed integrative analysis. It's important to note that while betaine was mentioned in section 1.3.2.2 as a biomarker with mechanistically relevant links to both C677T and hepatic function, it could not be included in this portion of the analysis due to missingness within the NHANES cohorts. The following files were identified as necessary for usable dataset construction. Each file contains at least one biological or survey component of interest for the analysis, and will require integration to form the complete dataset. Files needed per cohort: **2001–2002:** - DEMO_B.XPT (demographics, sex, age, weight) - BMX_B.XPT (BMI) - DIQ_B.XPT (diabetes) - L06_2_B.XPT (serum and RBC folate, B12, homocysteine, MMA, ferritin) - L40_B.XPT (ALT, AST, GGT, glucose) - L11_B.XPT (CRP) **2003–2004:** - DEMO_C.XPT (demographics, sex, age, weight) - BMX_C.XPT (BMI) - DIQ_C.XPT (diabetes) - L06NB_C.XPT (serum and RBC folate, B12) - L06MH_C.XPT (homocysteine, MMA) - L40_C.XPT (ALT, AST, GGT, glucose) - L11_C.XPT (CRP) - L06TFR_C.XPT (ferritin) Files for fasting-dependent variables were not included, as they weren't necessary for this analysis. #### 2.1.2: File Conversion & Ingestion Pipeline NHANES data files are stored online in SAS transport (.XPT) format. PgAdmin cannot read .XPT files, so the files needed to be downloaded locally and converted into CSVs prior to database development and schema population. A Python helper script (located at 'scripts/part2/python/xpt_to_csv.py' in the project remote repository), was used to convert all files and write their outputs into renamed CSVs stored in a clean_data directory. An example of the conversion code is included below: ```{python} #| eval: false #| label: xpt-to-csv #imports import pandas as pd #establish informative names for files via dictionary c2_dict = { "L06NB_C.xpt": "folate_b12", "L06MH_C.xpt": "hcy_mma", "L40_C.xpt": "alt_ast_ggt_glucose", "BMX_C.xpt": "bmi", "DEMO_C.xpt": "demo", "DIQ_C.xpt": "diabetes", "L11_C.xpt": "crp", "L06TFR_C.xpt": "iron" } #loop for c2 conversion and export for raw_file in c2_dict: df = pd.read_sas("../data/raw_data/" + raw_file, format="xport") df.to_csv("../data/clean_data/c2_" + c2_dict[raw_file] + ".csv", index=False) ``` #### 2.1.3: Database Architecture This project's database infrastructure used a three-schema approach: 1. **raw schema:** for importing & converting data 2. **staging schema:** for preprocessing & joining 3. **final schema:** for final table creation & export This structure allows for clear distinction between unprocessed data, intermediate adjustments, and analysis-ready outputs. ```{sql, connection=con, eval=FALSE} #| label: initialize-database CREATE schema IF NOT EXISTS raw; CREATE schema IF NOT EXISTS staging; CREATE schema IF NOT EXISTS final; ``` #### 2.1.4: Raw Table Creation & Loading The unabridged pre-processing pipeline for data structuring found in the project repository included the files: - create_raw_tables.sql - raw_load_copy_c1.sql - raw_load_copy_c2.sql This created empty tables for each NHANES CSV imported file. All variables were initially read in as raw text in order to avoid importation and/or missingness errors. After the tables were created, each CSV file is loaded into the raw schema's new, empty tables using COPY commands. ```{sql, connection=con, eval=FALSE} #| label: raw-table-create-load -- c2_crp table set up DROP TABLE IF EXISTS raw.c2_crp; CREATE TABLE raw.c2_crp ( SEQN TEXT, LBXCRP TEXT, LBXBAP TEXT, LBXPT21 TEXT ); -- c2 crp table population COPY raw.c2_crp FROM '/Users/coletterouiller/capstone/data/clean_data/c2_crp.csv' WITH (FORMAT CSV, HEADER true); ``` #### 2.1.5: Cleaning & Staging Cleaning and preprocessing were performed in the staging schema using the files: - staging_c1.sql - staging_c2.sql **Issue 1: Cohort Size Collapse** The initial organization plan was to start with the complete overlap of biomarkers and then join additional necessary or related variables through LEFT JOINs in order to retain maximum observation counts. However, initial efforts to define cohorts using complete biomarker overlap resulted in significant sample size reduction. The full biomarker panels collapsed into 551 and \~9000 for cohorts 1 and 2, respectively. To avoid this dramatic collapse, the cohort construction approach was revised to anchoring the dataset on the liver/metabolic file. One of the primary drivers of this decision was the project emphasis on population patterns/overall trends of liver disease, and the hepatic file (c1_alt_ast_ggt_glucose file) contained \~15,000 data points. Additionally, a small sub-analysis of the available complete metabolic panel (n=551) was retained in order to account for the chosen markers without defining the entire cohort at such a restrictive scale. ```{sql, connection=con, eval=FALSE} #| label: staging-anchor-join -- part 2: FLD, so anchor on liver biomarkers CREATE TABLE c1_clean AS SELECT CAST(CAST(c1_alt_ast_ggt_glucose.SEQN AS numeric) AS integer) AS seqn, -- [full column list continues — see staging-casting chunk below for additional examples] FROM raw.c1_alt_ast_ggt_glucose -- now that vars are retrieved, left join on SEQN LEFT JOIN raw.c1_demo USING (SEQN) LEFT JOIN raw.c1_bmi USING (SEQN) LEFT JOIN raw.c1_diabetes USING (SEQN) LEFT JOIN raw.c1_crp USING (SEQN) LEFT JOIN raw.c1_folate_b12_hcy_mma_iron USING (SEQN); ``` **Issue 2: Type Casting** During the .XPT to CSV conversion process, integer variables were automatically converted to floats by pandas regardless of downstream formatting specifications. A nested recasting approach was then used to correct this, in which all affected variables were first cast from text to numeric. Then, based on variable data type, variables were recast from numeric to integer data type when appropriate. This ensured datatype assignment was correct for downstream joins and analysis. ```{sql, connection=con, eval=FALSE} #| label: staging-casting -- set path SET search_path TO staging; -- create table CREATE TABLE c1_clean AS SELECT CAST(CAST(c1_alt_ast_ggt_glucose.SEQN AS numeric) AS integer) AS seqn, -- demographics vars to include CAST(CAST(NULLIF(c1_demo.RIDAGEYR, '') AS numeric) AS integer) AS age_year, CAST(CAST(NULLIF(c1_demo.RIAGENDR, '') AS numeric) AS integer) AS sex, CAST(CAST(NULLIF(c1_demo.RIDRETH1, '') AS numeric) AS integer) AS race_ethnicity, CAST(NULLIF(c1_demo.INDFMPIR, '') AS numeric) AS income_ratio, -- demo weights to include CAST(NULLIF(c1_demo.WTMEC2YR, '') AS numeric) AS exam_weight, CAST(CAST(NULLIF(c1_demo.SDMVPSU, '') AS numeric) AS integer) AS survey_psu, CAST(CAST(NULLIF(c1_demo.SDMVSTRA, '') AS numeric) AS integer) AS survey_strata ``` **Issue 3: Duplication** During standard data checks and manual observation practices, duplicate SEQN values were observed. This initially appeared to indicate a join duplication error(s) of some kind had occurred and required follow-up investigation to ensure data integrity. Debugging revealed multiple raw tables contained duplicate rows due to import issues. ```{sql, connection=con, eval=FALSE} #| label: duplication-check --looking for duplication after manually catching a potential duplication event at raw layer SELECT CAST(CAST(seqn AS numeric) AS integer) AS seqn_clean, COUNT(*) AS n_rows FROM raw.c2_diabetes GROUP BY seqn_clean HAVING COUNT(*) > 1 ORDER BY n_rows DESC, seqn_clean LIMIT 20; --looking for duplication after manually catching a potential duplication event at staging layer SELECT seqn, COUNT(*) AS n_rows FROM staging.c1_clean GROUP BY seqn HAVING COUNT(*) > 1 ORDER BY n_rows DESC, seqn LIMIT 20; ``` This was resolved by identifying the raw tables, correcting the duplication at the raw layer and then rebuilding and repopulating the affected tables. Final verification before cohort merging confirmed an absence of duplicates in raw and staging tables, as well as a single row per individual in final outputs. #### 2.1.6: Cohort Merging Cohorts were merged using UNION ALL instead of UNION to stack rows across both cohorts without duplicating participants. These operations were performed across all relevant component tables, with the final cohort structure prioritizing maximal participant inclusion, preservation of missingness, and flexibility for downstream filtering in Python and R. Missing biomarker values were retained and addressed during analysis rather than being removed during preprocessing, given the uneven biomarker coverage established across cohorts. ```{sql, connection=con, eval=FALSE} #| label: final-union-all -- use UNION ALL not UNION bc I am stacking, not de-duplicating SELECT seqn, 'c1_2001_2002' AS cohort, age_year, sex, race_ethnicity, ast, alt, ggt, glucose, homocysteine, rbc_folate, serum_folate, vit_b12, mm_acid, ferritin, crp, bmi, diabetes FROM staging.c1_clean UNION ALL SELECT seqn, 'c2_2003_2004' AS cohort, age_year, sex, race_ethnicity, ast, alt, ggt, glucose, homocysteine, rbc_folate, serum_folate, vit_b12, mm_acid, ferritin, crp, bmi, diabetes FROM staging.c2_clean; ``` #### 2.1.7: Final Analysis-Ready Export The final schema was used to construct the analysis-ready table for the ML modeling portion that follows. Key transformations include: - collapsing all joined tables into a single dataset - deriving new variables: - diabetes_binary (0 = no, 1 = yes) - a cohort indicator - ast_alt_ratio that set to NULL if either value is missing ```{sql, connection=con, eval=FALSE} #| label: final-derived-variables -- derived variables created at final schema stage CASE WHEN diabetes = 1 THEN 1 WHEN diabetes = 2 THEN 0 ELSE NULL END AS diabetes_binary, 'c1_2001_2002' AS cohort, CASE WHEN ast IS NOT NULL AND alt IS NOT NULL THEN ROUND(ast / NULLIF(alt, 0), 4) ELSE NULL END AS ast_alt_ratio ``` Final QC checks confirmed there was only one row per participant, there was consistent and plausible coverage counts with expected variation in biomarker availability across cohorts. ```{sql, connection=con, eval=FALSE} #| label: coverage-checks --QC check - c1 SELECT COUNT(*) AS total_rows, COUNT(DISTINCT seqn) AS distinct_seqn FROM staging.c1_clean; --QC check - c2 SELECT COUNT(*) AS total_rows, COUNT(DISTINCT seqn) AS distinct_seqn FROM staging.c2_clean; --biomarker coverage check -c1 SELECT COUNT(*) AS total_rows, COUNT(ast) AS n_ast, COUNT(alt) AS n_alt, COUNT(ggt) AS n_ggt, COUNT(glucose) AS n_glucose, COUNT(rbc_folate) AS n_rbc_folate, COUNT(vit_b12) AS n_vit_b12, COUNT(serum_folate) AS n_serum_folate, COUNT(homocysteine) AS n_homocysteine, COUNT(mm_acid) AS n_mma, COUNT(ferritin) AS n_ferritin, COUNT(crp) AS n_crp, COUNT(bmi) AS n_bmi, COUNT(diabetes) AS n_diabetes FROM staging.c1_clean; --biomarker coverage check - c2 SELECT COUNT(*) AS total_rows, COUNT(ast) AS n_ast, COUNT(alt) AS n_alt, COUNT(ggt) AS n_ggt, COUNT(glucose) AS n_glucose, COUNT(rbc_folate) AS n_rbc_folate, COUNT(vit_b12) AS n_vit_b12, COUNT(serum_folate) AS n_serum_folate, COUNT(homocysteine) AS n_homocysteine, COUNT(mm_acid) AS n_mma, COUNT(ferritin) AS n_ferritin, COUNT(crp) AS n_crp, COUNT(bmi) AS n_bmi, COUNT(diabetes) AS n_diabetes FROM staging.c2_clean; ``` It's worth noting that cohort 2's ferritin displayed lower coverage than other biomarkers, though that may be attributable to ferritin being stored in a separate source file (L06TFR_C.XPT), unlike in cohort 1. In addition, differences in biomarker availability across cohorts were retained and accounted for in downstream analysis.