Blog

Field Journal, 2026 Scholars, Week 2

Some thoughts from week 2!

Jun 05, 2026

Matthew Charles Lombardi

Undergraduate Research Assistant, Columbia University Department of Biological Sciences

Liked by Evalina Sain

What are some of the ethical issues that you are grappling with in your research? What are some of the ways in which you are responding to these questions?

Ultimately, machine learning models used for medical application are only as effective as the information they are trained on. If the genetic data from biomedical studies used with these models are biased towards (or against) specific patient populations, the model will carry those flaws forward. Unfortunately, the problem is often at the source of genetic data collection. This can underemphasize the needs and realities of many patients, as missing information and incomplete studies disproportionately affect minority communities. According to one NIH-published study, over 75% of all Genome-Wide Association Studies (GWAS) participants are of European descent, which is not indicative of the global population. This can dampen the potential medical benefits that AI can bring, widening the healthcare disparities that already exist. My project will attempt to lessen these effects by testing the RNA model in question on multiple distinct datasets, trying to keep the model’s predictions as broad and universally-applicable as possible. On a more personal note, projects that investigate what a model is actually learning (interpretability) allow us to uncover these biases. If we know how and what a model chooses to prioritize, we can guide it toward the information it is structurally blind to. I am planning to engage in more interpretability research like this in the future.

As you continue your research, have you considered alternative viewpoints in your investigation? If so, how have these alternative viewpoints enriched or changed your project?

As I review related literature with my mentor, we are starting to think that we should reconsider the metrics used to judge a model’s performance. When we looked at the correlation between a model’s predictions about an shRNA sequence and its real biological potency, the correlation was around 0.45, which we considered to be mediocre. However, I suspect this is a result of the way we normalize potency from 0 to 1 as standard convention. Biologically, when you scale most silencing-RNA sequences from 0 to 1, you would expect very few of them to be potent at gene silencing, with the rest being ineffective. By forcing a flat scale, what if we were penalizing a model for struggling to predict random, near-zero sequences when it is doing well with highly potent sequences? This alternative viewpoint is enriching the project as it is forcing me to understand the biological question within a statistical framework, to see if there are better ways to prepare the data for analysis and correlation. I am not sure what the outcome will be, but I am enjoying this chance to consider things in a different light.

Where does your research take place?

I am very fortunate to have a lab space on Columbia’s main campus to conduct some of the computational aspects of my project. However, a lot of my reading, analysis, and planning happens outside that space in various Columbia Libraries (including my favorite, The Burke Library). I can always seem to focus best here!

Referenced Pub-MED article:

Peterson, R. E., Kuchenbaecker, K., Walters, R. K., Chen, C. Y., Popejoy, A. B., Periyasamy, S., Lam, M., Iyegbe, C., Strawbridge, R. J., Brick, L., Carey, C. E., Martin, A. R., Meyers, J. L., Su, J., Chen, J., Edwards, A. C., Kalungi, A., Koen, N., Majara, L., Schwarz, E., … Duncan, L. E. (2019). Genome-wide Association Studies in Ancestrally Diverse Populations: Opportunities, Methods, Pitfalls, and Recommendations. Cell, 179(3), 589–603. https://doi.org/10.1016/j.cell.2019.08.051

Matthew Charles Lombardi

Undergraduate Research Assistant, Columbia University Department of Biological Sciences