A Messiah senior cybersecurity student, Carolina Hatch, presented her honors research project at the recent Harrisburg BSides conference, held on April 25 at the PA Farm Show Complex. Her session was titled “Synthetic vs. Real-World Data: A Study in Data Poisoning and its Effects,” and focused on how the AI models can be susceptible to false information. Dr. Bibighaus (her research advisor) and a group of Messiah students attended the event to support Carolina and experience the regional conference for cybersecurity professionals, hackers, and enthusiasts.
Congratulations to Carolina on an outstanding presentation!
SESSION ABSTRACT: As new Artificial Intelligence (AI) models are released seemingly every day, cybersecurity experts are growing increasingly concerned with the data used to train these models. As the pace to develop new models quickens, many are turning to AI-generated, synthetic data to increase efficiency. The use of synthetic data significantly reduces the time spent on data collection and cleaning, allowing for faster creation of new models. However, emerging research suggests that these techniques may be creating insecure, unstable models. To address this growing concern, we created two separate AI models, one trained on synthetic data, the other trained using real-world data and conducted data poising campaigns against each model. Our real-world data collection process focused on movie reviews published in newspapers from 1970-1999. The use of movie reviews allowed us to establish a baseline of truth from which to develop our poisoning campaign, and the period helped ensure each article was human- generated. The synthetic data was generated using GPT-3.5. By training two models trained on the same topic, we were able to directly compare the results of data poisoning in synthetic and real-world data. Our findings revealed the synthetic model was more susceptible to data poisoning than the real-world model, demonstrating a need for further research in this area.