Datasets

This page provides an overview of all datasets used in the Global Oculomics Initiative. Each dataset includes a brief description, role within the entire project, IRB considerations, and a single progress bar tracking its development. Milestones are updated continuously as the project evolves.

See the summary table to understand at a glance, and detailed information and progress tracking is available for each individual dataset below.

Summary Table

Dataset	Category	Role in Pipeline	IRB	Size	Status
Foundational Model Pretraining Dataset	Pretraining (Open Source)	Large-scale pretraining for backbone encoder	Not required	~4,000,000	DONE
UIC Retrospective Clinical Dataset	UIC Clinical	Fine-tuning multiclass disease classifier; multimodal modeling	STUDY2024-1118	~100,000	Final stages of IS
Auxiliary Oculoplastics Dataset	UIC Clinical	Auxiliary fine-tuning; fallback set	STUDY2024-1118	~1,000	DONE
Historic UIC CFC Dataset	UIC Clinical (CFC)	Pretraining	STUDY2024-1118	~17,000 images	DONE
UIC CFC Dataset	UIC Clinical	Fine-tuning + validation for craniofacial tasks	STUDY2024-1118	~1,500	DONE
GVF Dataset	UIC Clinical	Functional regression/classification task	STUDY2024-1118	~1,000	DONE
Glorbit Global Datasets	Prospective Global	Prospective validation; epidemiology; robustness	Site-specific IRBs + DUA with UIC	Target ~1,500 across sites	Active Collection

Individual Dataset Progress

This section contains progress tracking for each dataset being worked on. Expand the box to see the progress and next steps.

Foundational Model Pretraining Dataset

Curated from large-scale open-source face datasets (e.g., CelebA, LAION-Face, VGGFace2, FFHQ). Images are restricted to frontal views and cropped to maximize periorbital visibility. Periorbital distances are extracted using the segmentation and measurement algorithms from Georgie's PhD.

Category: Pretraining (Open Source)
IRB: Not required
Role: Primary pretraining dataset for backbone encoder
Size: ~2,000,000 expected
Personnel: Michelle, Georgie

Curate all candidate open-source face datasets
Transfer batch 1 datasets to workstation
Prep required datasets (UMD, Tufts, FIML)
Do summary stats on large dataset
Remove duplicate images and recount
Obtain subset 0
Build rotation toolkit
Create subset 1 (rotated) for all datasets
Develop inclusion criteria for S1
Sample images from imperfect datasets and pass to MF
Obtain % noise from all imperfect datasets on sample
Compute stats
Decide which datasets need further CNN cleaning
Repeat above for Batch 2 (vggface2 and FFHQ)
Regrade Neg images to create true negatives and true positives
Create training data for these datasets
train/ validate cleaners
Estimate error bounds for CNN deployment and establish thresholds
Repeat sampling and subsampling steps for Batch 2
Deploy on datasets to create Subset 2
Crop final images using log files to create Subset 3 for each dataset
Create Subset 3 and do summary statistics
Predict periorbital distances on Subset 3 to make Subset 4
Create Subset 5 (OS and OD)
Some type of masked autoencoder (?) training (meet with sathya) using Subset 5 to show that we can learn good representations
Ensure it is possible and easy to go from initial downloaded dataset to final product with software tools

Historic CFC Dataset

30 years of imaging data from UIC craniofacial center. ~17000 images. Many patients have multiple images from different visits. Unknown disease labels and operations. Both full face and cropped eye images available and preprocessed

Acquire from CFC
Move in batches using TB storage
Clean locally

UIC Retrospective Clinical Dataset — UIC Clinical

Retrospective cohort of oculoplastics and ophthalmology patients across a decade of visits, containing eye images at rest, periorbital measurements, ICD codes, procedural history, demographics, and (when available) functional testing such as GVF.

Category: UIC Clinical
IRB: STUDY2024-1118 (Amendment pending)
Role: Primary fine-tuning dataset for multiclass disease classification
Size: ~100,000
Personnel: Georgie, CCTS, IS, Dr. Setabutr, Dr. Hribar

Establish plan with CCTS and IS for EPIC extraction and CCTS duet
Draft IRB amendment and Data Request Agreement (DRA)
Obtain stakeholder sign-off from clinical and data teams
Submit IRB amendment and DRA for approval
Initiate CCTS data pull request
Receive CCTS data
Submit EPIC data extraction request
Receive finalized datasets from CCTS and EPIC
Assemble finalized multimodal clinical dataset for training

Auxiliary Oculoplastics Dataset — UIC Clinical

Five years of curated Dr. Tran’s oculoplastic cases. Backup fine-tuning dataset in case of delays or filtering issues in the primary retrospective dataset.

Category: UIC Clinical
IRB: STUDY2024-1118
Role: Auxiliary validation + fine-tuning dataset
Size: ~1,000
Personnel: Sophia, Dr. Tran, Georgie

Coordinate disease categories + scope with Dr. Tran and Sophia
Sophia generates curated dataset from EPIC
Finalize dataset transfer to Georgie

UIC CFC Dataset — UIC Clinical

Existing dataset of craniofacial syndromes from prior work. Includes multiple craniofacial subtypes and paired periorbital distances.

Category: UIC Clinical
IRB: STUDY2024-1118
Role: Fine-tuning + validation for craniofacial tasks
Size: ~1,500
Personnel: Dr. Purnell, Georgie

Obtain access to dataset + metadata from Dr. Purnell
Curate and structure dataset

GVF Dataset — UIC Clinical

Images paired with Goldmann Visual Field measurements, used for functional regression and classification tasks.

Category: UIC Clinical
IRB: STUDY2024-1118
Role: Functional prediction dataset
Size: ~1,000
Personnel: Georgie, Sasha, Dr. Setabutr, Dr. Tran

Establish list of ptosis patients with GVF values
Define extracted fields (Georgie + Sasha)
Medical students pull and organize dataset
Transfer finalized dataset to Georgie
Baseline GVF classification + regression
Draft manuscript
Submit manuscript following completion of Sasha’s thesis

Glorbit Global Datasets — Prospective Global

Prospective, globally diverse datasets collected through the Glorbit imaging platform at sites including Ecuador and future expansions in Peru, Panama, and Africa. Used for validation, epidemiology, usability assessment, and global robustness testing.

Category: Prospective Global
IRB: Site-specific; Ecuador approved; others pending
Role: Prospective validation + global robustness
Size: Target ~1,500 across 3–5 sites
Personnel: Georgie, Sasha, Dr. Setabutr, Dr. Tran, Dr. Larrick

Progress for all international sites is tracked on the Global Deployments page.