Datasets
This page provides an overview of all datasets used in the Global Oculomics Initiative. Each dataset includes a brief description, role within the entire project, IRB considerations, and a single progress bar tracking its development. Milestones are updated continuously as the project evolves.
See the summary table to understand at a glance, and detailed information and progress tracking is available for each individual dataset below.
Summary Table
| Dataset | Category | Role in Pipeline | IRB | Size | Status |
|---|---|---|---|---|---|
| Foundational Model Pretraining Dataset | Pretraining (Open Source) | Large-scale pretraining for backbone encoder | Not required | ~4,000,000 | DONE |
| UIC Retrospective Clinical Dataset | UIC Clinical | Fine-tuning multiclass disease classifier; multimodal modeling | STUDY2024-1118 | ~100,000 | Final stages of IS |
| Auxiliary Oculoplastics Dataset | UIC Clinical | Auxiliary fine-tuning; fallback set | STUDY2024-1118 | ~1,000 | DONE |
| Historic UIC CFC Dataset | UIC Clinical (CFC) | Pretraining | STUDY2024-1118 | ~17,000 images | DONE |
| UIC CFC Dataset | UIC Clinical | Fine-tuning + validation for craniofacial tasks | STUDY2024-1118 | ~1,500 | DONE |
| GVF Dataset | UIC Clinical | Functional regression/classification task | STUDY2024-1118 | ~1,000 | DONE |
| Glorbit Global Datasets | Prospective Global | Prospective validation; epidemiology; robustness | Site-specific IRBs + DUA with UIC | Target ~1,500 across sites | Active Collection |
Individual Dataset Progress
This section contains progress tracking for each dataset being worked on. Expand the box to see the progress and next steps.
Foundational Model Pretraining Dataset
Curated from large-scale open-source face datasets (e.g., CelebA, LAION-Face, VGGFace2, FFHQ). Images are restricted to frontal views and cropped to maximize periorbital visibility. Periorbital distances are extracted using the segmentation and measurement algorithms from Georgie's PhD.
Category: Pretraining (Open Source)
IRB: Not required
Role: Primary pretraining dataset for backbone encoder
Size: ~2,000,000 expected
Personnel: Michelle, Georgie
- Curate all candidate open-source face datasets
- Transfer batch 1 datasets to workstation
- Prep required datasets (UMD, Tufts, FIML)
- Do summary stats on large dataset
- Remove duplicate images and recount
- Obtain subset 0
- Build rotation toolkit
- Create subset 1 (rotated) for all datasets
- Develop inclusion criteria for S1
- Sample images from imperfect datasets and pass to MF
- Obtain % noise from all imperfect datasets on sample
- Compute stats
- Decide which datasets need further CNN cleaning
- Repeat above for Batch 2 (vggface2 and FFHQ)
- Regrade Neg images to create true negatives and true positives
- Create training data for these datasets
- train/ validate cleaners
- Estimate error bounds for CNN deployment and establish thresholds
- Repeat sampling and subsampling steps for Batch 2
- Deploy on datasets to create Subset 2
- Crop final images using log files to create Subset 3 for each dataset
- Create Subset 3 and do summary statistics
- Predict periorbital distances on Subset 3 to make Subset 4
- Create Subset 5 (OS and OD)
- Some type of masked autoencoder (?) training (meet with sathya) using Subset 5 to show that we can learn good representations
- Ensure it is possible and easy to go from initial downloaded dataset to final product with software tools
Historic CFC Dataset
30 years of imaging data from UIC craniofacial center. ~17000 images. Many patients have multiple images from different visits. Unknown disease labels and operations. Both full face and cropped eye images available and preprocessed
- Acquire from CFC
- Move in batches using TB storage
- Clean locally
UIC Retrospective Clinical Dataset — UIC Clinical
Retrospective cohort of oculoplastics and ophthalmology patients across a decade of visits, containing eye images at rest, periorbital measurements, ICD codes, procedural history, demographics, and (when available) functional testing such as GVF.
Category: UIC Clinical
IRB: STUDY2024-1118 (Amendment pending)
Role: Primary fine-tuning dataset for multiclass disease classification
Size: ~100,000
Personnel: Georgie, CCTS, IS, Dr. Setabutr, Dr. Hribar
- Establish plan with CCTS and IS for EPIC extraction and CCTS duet
- Draft IRB amendment and Data Request Agreement (DRA)
- Obtain stakeholder sign-off from clinical and data teams
- Submit IRB amendment and DRA for approval
- Initiate CCTS data pull request
- Receive CCTS data
- Submit EPIC data extraction request
- Receive finalized datasets from CCTS and EPIC
- Assemble finalized multimodal clinical dataset for training
Auxiliary Oculoplastics Dataset — UIC Clinical
Five years of curated Dr. Tran’s oculoplastic cases. Backup fine-tuning dataset in case of delays or filtering issues in the primary retrospective dataset.
Category: UIC Clinical
IRB: STUDY2024-1118
Role: Auxiliary validation + fine-tuning dataset
Size: ~1,000
Personnel: Sophia, Dr. Tran, Georgie
- Coordinate disease categories + scope with Dr. Tran and Sophia
- Sophia generates curated dataset from EPIC
- Finalize dataset transfer to Georgie
UIC CFC Dataset — UIC Clinical
Existing dataset of craniofacial syndromes from prior work. Includes multiple craniofacial subtypes and paired periorbital distances.
Category: UIC Clinical
IRB: STUDY2024-1118
Role: Fine-tuning + validation for craniofacial tasks
Size: ~1,500
Personnel: Dr. Purnell, Georgie
- Obtain access to dataset + metadata from Dr. Purnell
- Curate and structure dataset
GVF Dataset — UIC Clinical
Images paired with Goldmann Visual Field measurements, used for functional regression and classification tasks.
Category: UIC Clinical
IRB: STUDY2024-1118
Role: Functional prediction dataset
Size: ~1,000
Personnel: Georgie, Sasha, Dr. Setabutr, Dr. Tran
- Establish list of ptosis patients with GVF values
- Define extracted fields (Georgie + Sasha)
- Medical students pull and organize dataset
- Transfer finalized dataset to Georgie
- Baseline GVF classification + regression
- Draft manuscript
- Submit manuscript following completion of Sasha’s thesis
Glorbit Global Datasets — Prospective Global
Prospective, globally diverse datasets collected through the Glorbit imaging platform at sites including Ecuador and future expansions in Peru, Panama, and Africa. Used for validation, epidemiology, usability assessment, and global robustness testing.
Category: Prospective Global
IRB: Site-specific; Ecuador approved; others pending
Role: Prospective validation + global robustness
Size: Target ~1,500 across 3–5 sites
Personnel: Georgie, Sasha, Dr. Setabutr, Dr. Tran, Dr. Larrick
Progress for all international sites is tracked on the Global Deployments page.