Skip to content

Datasets

This page provides an overview of all datasets used in the Global Oculomics Initiative. Each dataset includes a brief description, role within the entire project, IRB considerations, and a single progress bar tracking its development. Milestones are updated continuously as the project evolves.

See the summary table to understand at a glance, and detailed information and progress tracking is available for each individual dataset below.


Summary Table

Dataset Category Role in Pipeline IRB Size Status
Foundational Model Pretraining Dataset Pretraining (Open Source) Large-scale pretraining for backbone encoder Not required ~4,000,000 DONE
UIC Retrospective Clinical Dataset UIC Clinical Fine-tuning multiclass disease classifier; multimodal modeling STUDY2024-1118 ~100,000 Final stages of IS
Auxiliary Oculoplastics Dataset UIC Clinical Auxiliary fine-tuning; fallback set STUDY2024-1118 ~1,000 DONE
Historic UIC CFC Dataset UIC Clinical (CFC) Pretraining STUDY2024-1118 ~17,000 images DONE
UIC CFC Dataset UIC Clinical Fine-tuning + validation for craniofacial tasks STUDY2024-1118 ~1,500 DONE
GVF Dataset UIC Clinical Functional regression/classification task STUDY2024-1118 ~1,000 DONE
Glorbit Global Datasets Prospective Global Prospective validation; epidemiology; robustness Site-specific IRBs + DUA with UIC Target ~1,500 across sites Active Collection

Individual Dataset Progress

This section contains progress tracking for each dataset being worked on. Expand the box to see the progress and next steps.

Foundational Model Pretraining Dataset

Curated from large-scale open-source face datasets (e.g., CelebA, LAION-Face, VGGFace2, FFHQ). Images are restricted to frontal views and cropped to maximize periorbital visibility. Periorbital distances are extracted using the segmentation and measurement algorithms from Georgie's PhD.

Category: Pretraining (Open Source)
IRB: Not required
Role: Primary pretraining dataset for backbone encoder
Size: ~2,000,000 expected
Personnel: Michelle, Georgie

  • Curate all candidate open-source face datasets
  • Transfer batch 1 datasets to workstation
  • Prep required datasets (UMD, Tufts, FIML)
  • Do summary stats on large dataset
  • Remove duplicate images and recount
  • Obtain subset 0
  • Build rotation toolkit
  • Create subset 1 (rotated) for all datasets
  • Develop inclusion criteria for S1
  • Sample images from imperfect datasets and pass to MF
  • Obtain % noise from all imperfect datasets on sample
  • Compute stats
  • Decide which datasets need further CNN cleaning
  • Repeat above for Batch 2 (vggface2 and FFHQ)
  • Regrade Neg images to create true negatives and true positives
  • Create training data for these datasets
  • train/ validate cleaners
  • Estimate error bounds for CNN deployment and establish thresholds
  • Repeat sampling and subsampling steps for Batch 2
  • Deploy on datasets to create Subset 2
  • Crop final images using log files to create Subset 3 for each dataset
  • Create Subset 3 and do summary statistics
  • Predict periorbital distances on Subset 3 to make Subset 4
  • Create Subset 5 (OS and OD)
  • Some type of masked autoencoder (?) training (meet with sathya) using Subset 5 to show that we can learn good representations
  • Ensure it is possible and easy to go from initial downloaded dataset to final product with software tools


Historic CFC Dataset

30 years of imaging data from UIC craniofacial center. ~17000 images. Many patients have multiple images from different visits. Unknown disease labels and operations. Both full face and cropped eye images available and preprocessed

  • Acquire from CFC
  • Move in batches using TB storage
  • Clean locally


UIC Retrospective Clinical Dataset — UIC Clinical

Retrospective cohort of oculoplastics and ophthalmology patients across a decade of visits, containing eye images at rest, periorbital measurements, ICD codes, procedural history, demographics, and (when available) functional testing such as GVF.

Category: UIC Clinical
IRB: STUDY2024-1118 (Amendment pending)
Role: Primary fine-tuning dataset for multiclass disease classification
Size: ~100,000
Personnel: Georgie, CCTS, IS, Dr. Setabutr, Dr. Hribar

  • Establish plan with CCTS and IS for EPIC extraction and CCTS duet
  • Draft IRB amendment and Data Request Agreement (DRA)
  • Obtain stakeholder sign-off from clinical and data teams
  • Submit IRB amendment and DRA for approval
  • Initiate CCTS data pull request
  • Receive CCTS data
  • Submit EPIC data extraction request
  • Receive finalized datasets from CCTS and EPIC
  • Assemble finalized multimodal clinical dataset for training


Auxiliary Oculoplastics Dataset — UIC Clinical

Five years of curated Dr. Tran’s oculoplastic cases. Backup fine-tuning dataset in case of delays or filtering issues in the primary retrospective dataset.

Category: UIC Clinical
IRB: STUDY2024-1118
Role: Auxiliary validation + fine-tuning dataset
Size: ~1,000
Personnel: Sophia, Dr. Tran, Georgie

  • Coordinate disease categories + scope with Dr. Tran and Sophia
  • Sophia generates curated dataset from EPIC
  • Finalize dataset transfer to Georgie


UIC CFC Dataset — UIC Clinical

Existing dataset of craniofacial syndromes from prior work. Includes multiple craniofacial subtypes and paired periorbital distances.

Category: UIC Clinical
IRB: STUDY2024-1118
Role: Fine-tuning + validation for craniofacial tasks
Size: ~1,500
Personnel: Dr. Purnell, Georgie

  • Obtain access to dataset + metadata from Dr. Purnell
  • Curate and structure dataset


GVF Dataset — UIC Clinical

Images paired with Goldmann Visual Field measurements, used for functional regression and classification tasks.

Category: UIC Clinical
IRB: STUDY2024-1118
Role: Functional prediction dataset
Size: ~1,000
Personnel: Georgie, Sasha, Dr. Setabutr, Dr. Tran

  • Establish list of ptosis patients with GVF values
  • Define extracted fields (Georgie + Sasha)
  • Medical students pull and organize dataset
  • Transfer finalized dataset to Georgie
  • Baseline GVF classification + regression
  • Draft manuscript
  • Submit manuscript following completion of Sasha’s thesis


Glorbit Global Datasets — Prospective Global

Prospective, globally diverse datasets collected through the Glorbit imaging platform at sites including Ecuador and future expansions in Peru, Panama, and Africa. Used for validation, epidemiology, usability assessment, and global robustness testing.

Category: Prospective Global
IRB: Site-specific; Ecuador approved; others pending
Role: Prospective validation + global robustness
Size: Target ~1,500 across 3–5 sites
Personnel: Georgie, Sasha, Dr. Setabutr, Dr. Tran, Dr. Larrick

Progress for all international sites is tracked on the Global Deployments page.