I Built a Vulnerable Finance App & Let AI Hack It – Here's What Happened (and How Much It Cost)

The rise of Large Language Models (LLMs) like GPT-4, Gemini, and Claude has been nothing short of revolutionary. They’re writing code, creating content, and even… hacking? That last one kept me up at night. As someone deeply involved in fintech and application security, I wanted to understand just how dangerous these tools could be in the wrong hands. So, I decided to run a pretty extreme experiment: I built a deliberately vulnerable finance app and spent $1,500 trying to get LLMs to exploit it. This isn't about scaremongering; it's about understanding the risks and learning how to build more secure applications.

§Why Test LLMs Against a Finance App?

Financial applications are prime targets for hackers. A successful breach doesn’t just mean data loss; it can mean direct financial harm to users. The stakes are incredibly high. Furthermore, finance apps often rely on complex APIs and integrations, introducing more potential attack surfaces.

I chose to focus on LLMs because of their unique capabilities. Traditional security testing often focuses on known vulnerabilities. LLMs, however, can discover vulnerabilities through creative prompt engineering, and potentially exploit them in ways human testers might not anticipate. The ability to generate and test code automatically adds another dimension to their threat.

High Stakes: Financial data is highly valuable.
Complex Systems: Finance apps often have multiple integrations.
Novel Attack Vectors: LLMs can uncover unknown vulnerabilities.
Automated Exploitation: LLMs can rapidly test and refine attack methods.

§Building the Vulnerable App: "PennyPincher"

I named my test application “PennyPincher.” It was intentionally simplified to focus on common vulnerabilities. It wasn’t meant to be a real-world application, but rather a controlled environment for testing.

§Here’s a simplified overview of PennyPincher’s architecture:

Frontend (React): A basic user interface for viewing account balances and transaction history.
Backend (Python/Flask): Handles user authentication, data storage, and API requests.
Database (PostgreSQL): Stores user data, account balances, and transaction information.
API Endpoints: Designed with a few key vulnerabilities (detailed below).

§I specifically introduced these vulnerabilities:

SQL Injection: A classic vulnerability where malicious SQL code could be injected through user input fields. I left the balance query particularly exposed.
Insecure Direct Object Reference (IDOR): Users could potentially access data belonging to other users by manipulating the account ID in API requests.
Weak Input Validation: The application didn't adequately sanitize user input, potentially allowing for cross-site scripting (XSS) attacks.
Exposed API Keys: I intentionally left a placeholder API key visible in the frontend code (a big no-no in real applications!).

§The Experiment: Unleashing the LLMs

My goal wasn't just to see if an LLM could find vulnerabilities, but to quantify the cost and effort involved. I primarily used three LLMs:

GPT-4: The gold standard, known for its reasoning and coding abilities. I accessed it through the OpenAI API, incurring costs based on token usage.
Gemini (via Google AI Studio): Google’s powerful model, also accessed via an API with usage-based pricing.
Claude 3 Opus: Anthropic's flagship model, lauded for its strong performance and nuanced understanding. Accessed via API.

I approached the testing in stages, increasing complexity with each iteration.

§Phase 1: Reconnaissance & Vulnerability Identification (Cost: ~$300)

I started by asking the LLMs to analyze the app's documentation (which I also intentionally made slightly misleading) and identify potential vulnerabilities. I gave each LLM access to the API documentation and the frontend code (excluding the API key, initially).

The results were surprisingly good. GPT-4 quickly identified the potential for SQL injection. Gemini flagged the IDOR vulnerability, and Claude 3 pointed out the lack of robust input validation. This phase demonstrated that LLMs are excellent at passive reconnaissance. They can sift through code and documentation to identify weaknesses with minimal prompting. I used prompt engineering techniques to encourage detailed vulnerability reports, requesting explanations and potential exploitation strategies.

§Phase 2: Exploitation Attempts (Cost: ~$800)

This is where things got interesting. I tasked the LLMs with crafting exploits to leverage the vulnerabilities they had identified. I provided them with examples of API requests and encouraged them to modify them to achieve malicious goals.

SQL Injection: GPT-4 and Gemini both generated functional SQL injection payloads that successfully retrieved account balances from other users. This was deeply concerning. The payloads weren't sophisticated, but they worked.
IDOR: Gemini and Claude 3 were successful in exploiting the IDOR vulnerability, gaining access to sensitive transaction data belonging to other users.
API Key Discovery: I then revealed the placeholder API key in the frontend code. All three LLMs immediately identified it. GPT-4 even suggested how it could be used to gain unauthorized access to related services.

This phase highlighted the LLMs’ ability to translate theoretical vulnerabilities into practical exploits. It was far easier and cheaper than hiring a penetration tester, and the results were alarming. I spent a significant amount of time reviewing the LLM-generated code to ensure it didn't contain any malicious side effects beyond the intended exploits.

§Phase 3: Advanced Attacks & Chaining (Cost: ~$400)

In the final phase, I challenged the LLMs to combine vulnerabilities to achieve more significant impacts. Could they use the SQL injection vulnerability to discover the API key, and then use the API key to drain accounts?

GPT-4 and Gemini were able to construct a chain of exploits that successfully simulated a complete account takeover. They started with SQL injection to find another user's API token (not the deliberately exposed one!), then used that token to make fraudulent transactions.

§Key Findings & What This Means for Developers

This experiment wasn’t about proving LLMs are inherently evil. It was about understanding their capabilities and the potential risks. Here’s what I learned:

LLMs are Powerful Vulnerability Finders: They can quickly identify common vulnerabilities with minimal guidance.
LLMs Can Generate Functional Exploits: They can translate vulnerabilities into working code.
Chain Attacks are Possible: LLMs can combine multiple vulnerabilities to achieve more significant impacts.
The Cost is Low: Exploiting vulnerabilities with LLMs can be surprisingly affordable. $1,500 is less than a single day of a skilled penetration tester.
Prompt Engineering is Critical: The quality of the prompts heavily influences the results. Well-crafted prompts yield more detailed and accurate vulnerability reports and exploits.

§How to Protect Your Finance App

So, what can developers do to protect their applications?

Secure Coding Practices: Prioritize secure coding practices from the outset. Implement robust input validation, parameterized queries, and secure authentication mechanisms. https://example.com/ A good resource for this is the OWASP Top Ten.
Regular Security Audits: Conduct regular security audits and penetration testing to identify and address vulnerabilities.
API Security: Protect your APIs with authentication, authorization, and rate limiting. Never expose API keys in client-side code!
Web Application Firewalls (WAFs): Implement a WAF to filter out malicious traffic and protect against common attacks.
Runtime Application Self-Protection (RASP): RASP solutions monitor application behavior and can detect and prevent attacks in real-time.
Stay Updated: Keep your software and dependencies up to date to patch known vulnerabilities.
Monitor API Usage: Implement robust monitoring and alerting to detect unusual API activity.

§The Future of AI and Security

This experiment is just the beginning. As LLMs become more sophisticated, they will inevitably become more powerful hacking tools. We need to adapt our security practices to stay ahead of the curve. Investing in automated security testing tools, incorporating AI-powered security solutions, and fostering a security-conscious culture within development teams will be crucial. The cat-and-mouse game between attackers and defenders just got a whole lot more interesting.

§Disclaimer:

I am an affiliate of Amazon and Bol.com and may earn a commission if you purchase products through the links provided in this article. The inclusion of these links does not influence my review or opinion of the products. The purpose of this experiment was to demonstrate potential security vulnerabilities for educational purposes. Do not attempt to exploit vulnerabilities in systems you do not own or have explicit permission to test. Always act ethically and responsibly.