Cyril Nana Boakye Benson profile picture

Hello 👋, I'm

Cyril Nana Boakye Benson

Data Scientist & Biostatistician

Scroll down arrow

Get To Know More

About Me

I'm Cyril Benson, a Data Scientist and Biostatistician with an M.S. in Statistics from Oregon State University and a B.S. in Actuarial Science from Kwame Nkrumah University of Science and Technology (KNUST).

I have experience architecting end-to-end ML systems, scalable data pipelines, and advanced statistical frameworks across healthcare, insurance, and research domains. Expert in designing and deploying production-grade predictive models using Python, R, SQL, and GCP, with deep fluency in XGBoost, LightGBM, survival analysis, NLP (BERT), and Bayesian inference.

Proven ability to deliver executive-ready insights through Power BI and Tableau, lead cross-functional analytics initiatives, and translate complex business problems into high-impact, data-driven solutions. Recognized for rigorous statistical methodology, MLOps best practices, and consistent delivery of measurable outcomes across complex, regulated environments.

Get To Know My

Experience

CRM & Data Analyst

The Brooklyn Wellness Club

June 2025 – Present

Statistical Consultant

Oregon State University

September 2024 – June 2025

Graduate Teaching Assistant

Oregon State University

September 2023 – June 2025

Data & Research Analyst

KNUST

October 2022 – August 2023

Actuarial Data Analyst

National Insurance Trust

September 2021 – December 2021

Explore My

Skills

Languages & Tools

Experience icon

Python

Experience icon

R

Experience icon

SAS

Experience icon

SQL

Experience icon

Git

Experience icon

Linux

Experience icon

AWS

Experience icon

GCP

Experience icon

PostgreSQL

Experience icon

MongoDB

Experience icon

Tableau

Experience icon

Power BI

Experience icon

R-Shiny

Experience icon

MLflow

ML & Statistics

Experience icon

XGBoost

Experience icon

LightGBM

Experience icon

Scikit-learn

Experience icon

TensorFlow

Experience icon

PyTorch

Experience icon

BERT

Experience icon

Bayesian Inference

Experience icon

Survival Analysis

Experience icon

RNA-seq

Experience icon

ggplot2

Experience icon

Matplotlib

Experience icon

CDISC SDTM

Experience icon

TLFs

Experience icon

SAPs

My

Certifications

Professional certificates spanning the IBM Data Science Professional Certificate track (Coursera) and SAS programming.
View all badges on Credly →

IBM Data Science Professional Certificate badge IBM Data Science Professional Certificate Coursera · IBM View Credential
Getting Started with SAS Programming badge Getting Started with SAS® Programming SAS View Credential
Machine Learning with Python badge Machine Learning with Python Coursera · IBM View Credential
Databases and SQL for Data Science badge Databases and SQL for Data Science Coursera · IBM View Credential
Applied Data Science Capstone badge Applied Data Science Capstone Coursera · IBM View Credential

Browse My Recent

Projects

Diabetes Risk Prediction: EDA, Feature Engineering, Model Comparison & Interactive Risk-Screening Dashboard

Diabetes Risk Dashboard

Built an end-to-end machine learning pipeline to predict diabetes risk from 253,680 CDC BRFSS health survey responses. Rigorous statistical testing — chi-square and Cramér's V for categorical features, Kruskal-Wallis and epsilon-squared for continuous ones — across 22 health indicators identified high blood pressure, difficulty walking, general health, high cholesterol, and heart disease history as the leading risk factors, despite a severe class imbalance (84.2% No Diabetes, 1.8% Prediabetes, 13.9% Diabetes). After engineering a WHO BMI category, a lifestyle score, and a comorbidity score, five classifiers — Logistic Regression, Random Forest, HistGradientBoosting, XGBoost, and LightGBM — were benchmarked on a stratified 70–15–15 split with class-weighted training and hyperparameter tuning via RandomizedSearchCV. The final LightGBM model achieved a Macro ROC-AUC of 0.775 and was explained with SHAP, surfacing general health, comorbidity score, and age as the top global risk drivers. The model is deployed as a live, interactive Streamlit dashboard with an overview, a filterable EDA explorer, a real-time risk predictor, and a model insights view.

Falcon 9 First Stage Landing Prediction: Data Collection, SQL & Visual EDA, Geospatial Analysis & Classification Modeling

Falcon 9 First Stage Landing Prediction

Built an end-to-end pipeline to predict whether a SpaceX Falcon 9 booster's first stage would land successfully, using 90 launches sourced independently from SpaceX's public REST API and a Wikipedia table scraped with BeautifulSoup, then reconciled against each other before modeling. Landing labels were engineered from raw outcome strings while treating missing landing-pad data as informative — rather than erroneous — and explored through direct SQL queries against a Db2 table, matplotlib/seaborn analysis of flight number, payload mass, launch site and orbit, and an interactive Folium map and Plotly Dash dashboard profiling each launch site's geography and success rate. Four classifiers — Logistic Regression, SVM, Decision Tree, and KNN — were tuned via GridSearchCV with 10-fold cross-validation on an 80/20 split; the tuned Decision Tree was the strongest performer at 94.4% held-out test accuracy, correctly calling all 12 successful landings and 5 of 6 failures, framed throughout as the basis for estimating SpaceX's per-launch cost advantage from booster reuse.

Clinical Data Science Pipeline: EDA, Biostatistical Inference, Survival Modeling, Multi-Class ML & Patient Segmentation

Healthcare Analytics

Built a production-grade clinical analytics framework applied to 54,966 synthetic patient records across a six-phase pipeline. The process began with rigorous data provenance work — detecting and correcting 534 duplicate records and 108 erroneous billing entries, and engineering Length of Stay as a time-to-event variable — followed by multi-dimensional exploratory analysis across demographics, clinical patterns, financials, and operational performance using interactive Plotly visualisations. Biostatistical inference — chi-square tests of independence, Kruskal-Wallis ANOVA, and odds ratios with 95% confidence intervals — was used to interrogate clinical associations, while Kaplan-Meier survival functions and a multivariate Cox Proportional Hazards model (concordance = 0.506) characterised time-to-discharge. Logistic Regression, XGBoost, and Random Forest classifiers were trained and compared for tri-class test result prediction, with the best model (Random Forest, Macro F1 = 0.43, ROC-AUC = 0.629) explained via SHAP, before the pipeline concludes with K-Means clustering and UMAP dimensionality reduction for unsupervised patient segmentation.

Predictive Modeling in Low- and High-Dimensional Settings: A Machine Learning Approach to Idiopathic Pulmonary Fibrosis (IPF) Progression

Idiopathic Pulmonary Fibrosis Progression Modeling — ROC curve and variable importance

Applied a two-phase supervised learning framework to predict disease progression in Idiopathic Pulmonary Fibrosis (IPF) using clinical and high-dimensional proteomic data from 60 patients monitored over 80 weeks, of whom 58% progressed. In a low-dimensional setting using 14 curated covariates — including six proteomic biomarkers linked to IPF progression — Decision Tree, Random Forest, and Logistic Regression classifiers were trained on a stratified 70/30 split with 10-fold cross-validation, tuned via out-of-bag error and GVIF diagnostics; Random Forest performed best, reaching 83.3% accuracy, 100% sensitivity, and an AUC of 0.984. In a high-dimensional setting using the full 1,129-covariate proteomic profile, LASSO logistic regression and Random Forest were compared through regularisation and ensemble tuning; Random Forest again outperformed (61.1% accuracy, F1 = 0.842, AUC = 0.569), though both models' limited generalisation highlighted the trade-off between model flexibility and statistical power when predictors vastly outnumber observations.

Bayesian Linear Regression with Gibbs Sampling: Modeling Healthcare Costs Using Demographic and Lifestyle Predictors

Bayesian Linear Regression with Gibbs Sampling

Built a Bayesian linear regression model with a custom Gibbs sampler (5,000 iterations, 1,000 burn-in) to estimate the effects of age, sex, BMI, number of children, smoking status, and region on medical insurance charges for 1,338 policyholders, using conjugate normal-inverse-gamma priors for closed-form posterior updates. Posterior estimates identified smoking status as the dominant driver of cost (posterior mean β = 0.795, P(β > 0) = 1.000), followed by age (β = 0.298) and BMI (β = 0.171), with smaller but credible effects for number of children and the southeast/southwest regions. A parallel frequentist OLS model produced consistent point estimates and significance patterns (e.g., smoking added $23,849 to charges, p < 0.0001), and MCMC trace and density diagnostics confirmed good mixing and convergence — illustrating how Bayesian credible intervals and inclusion probabilities can complement traditional p-values for healthcare cost modelling.

Survival Analysis of Cyclosporin A Treatment Efficacy in Primary Biliary Cirrhosis: Kaplan-Meier, Cox Proportional Hazards & Weibull AFT Modeling

Survival Analysis of Primary Biliary Cirrhosis

Analysed survival outcomes for 349 patients in the PBC3 multi-centre randomised trial to assess whether Cyclosporin A (CyA) reduces the risk of treatment failure — death or liver transplant — in Primary Biliary Cirrhosis, with all modelling carried out in SAS. Kaplan-Meier estimation found no significant survival difference between CyA and placebo (p = 0.78), but revealed disease stage, sex, and gastrointestinal bleeding history as strong univariate prognostic factors (p < 0.0001, p = 0.006, and p < 0.05 respectively). A multivariate Cox Proportional Hazards model (global p < 0.0001) identified sex, bilirubin, albumin, and disease stage as significant predictors of survival, while a Weibull Accelerated Failure Time model corroborated these findings and additionally flagged aspartate transaminase and treatment as significant accelerants of time-to-failure; residual diagnostics confirmed both models' assumptions were well met, aside from a proportional-hazards violation for stage 1.

Get in Touch

Contact Me