project-documentation/week1.md

# Project Context: PetCLinic  Microservices -> AWS

## Context and Scope

This project will migrate the Spring PetClinic Microservices demo from its local/on-premise setup to AWS Cloud. The focus is infrastructure modernization, CI/CD automation, observability, and resilience but not application feature development.

[**Pdf Download (click me)**](assets/week1/week1.pdf)

### Stakeholders

| Role | Responsibility |
|----|----|
| Project Sponsor | Funding, final approval |
| Project Manager | Scheduling, stakeholder coordination |
| Cloud Architect | Architecture, service selection |
| Dev Lead | App changes for cloud readiness |
| DevOps Engineer | CI/CD, IaC, deployments, monitoring |
| Security Team | IAM, encryption |
| End Users / Demo Audience | Acceptance and usability feedback

### Expectations

- No app feature development unless necessary for cloud deployment.
- AWS is the only target cloud

### Objectives

- Run full PetClinic microservices on AWS with CI/CD.
- Observability: logs, metrics, traces for 100% of services.
- Cost target: keep monthly infra cost under a defined limit
.
- Security: secrets encrypted, least-privilege IAM, HTTPS for all endpoints.

### Deadlines

| Milestone | Date |
|----|----|
| Project approval | Oct 27, 2025 |
| CI/CD & Automation | Nov 3, 2025 |
| Infrastructure | Nov 10,  2025 |
| Data | Nov 17, 2025 |
| Observability | Nov 24, 2025 |
| Prep: Presentation, Demo, and Pre-defense | Dec 3, 2025 |

## In Scope

| Included items | Objective |
|----|----|
| Application | Only necessary changes (if applicable) to facilitate cloud integration |
| Infrastructure | Design and deploy a reproducible, cloud-native architecture |
| CI/CD Automation | Implement automated build, test, and deployment pipelines |
| Containerization | Adapt existing microservices to use AWS.|
| Monitoring & Logging | Centralized logs, metrics, and traces |
| Security & IAM | Least-privilege IAM roles, encryption, and subnet segmentation. |
| Backup & Recovery | Redundancy, failover, backup, BCP/DRP |
| Documentation | Architecture diagrams, specifications, and operational runbooks. |

## Out of Scope

| Excluded items | Reason |
|----|----|
| Application feature or UI changes | Functionality remains unchanged. |
| Multi-cloud or hybrid deployment | Focus solely on AWS environment. |
| Cost-optimization | Addressed in a later project if necessary |


## Requirements

### Functional requirements
| Stakeholder / Role | Requirement | Description |
|----|----|----|
| **Developers** | Continuous Integration | Each merge must trigger automated build, test, and image creation. |
| | Local to Cloud Parity | Development environment must mirror AWS setup. |
| **DevOps Engineers** | Automated Deployment | CI/CD pipeline must deploy microservices to Staging and Prod environments automatically. |
| | Test Automation | Integration tests must run automatically in CI/CD pipeline. |
|| Infrastructure as Code | All AWS resources defined through configuration files |
| | Monitoring & Alerts | Centralized logging, metrics, and tracing for all microservices. Automated alerting for service downtime or threshold breaches. |
| | Scalability | Services must be scalable |
| **Security Team** | Access Control | Roles per service with least-privilege permissions.|
| | Secrets Management | All secrets stored securely. |
| **Product / Management** | Availability & Demo Readiness | System must be reliable and presentable for client or internal demos.|
| **End Users (Demo Audience)** | Stable Access | Web UI and APIs must remain responsive under typical load. |

### Non-Functional Requirements
| Category | Requirement | Standard |
|----|----|----|
| **Development** | Local to Cloud parity | Docker Compose or local ECS simulation |
| **Maintainability** | IaC  | Code stored in version control |
| **Performance** | Pipeline execution time | < 10 minutes per merge |
| | Scaling | Services can be scaled horizontally |
| | API / UI response | < 200 ms under normal demo load |
| **Reliability** | Deployment success rate | ≥ 99% successful deployments |
| | Alert response | Alerts trigger within < 5 minutes of failure detection |
| | Error tolerance | < 0.1% failed requests |
| **Availability** | System uptime | ≥ 99.9% uptime |
| **Observability** | Logs, Metrics, Traces | Centralized in monitoring solution |
| **Security** | Least-privileged Roles | Roles restricted per service; no default full-access policies |
| | Secret encryption | Secrets stored in AWS |
| **Cost** | Budget target | Monthly AWS cost ≤ defined cap |

## System Components — Spring PetClinic Microservices
| Component | Role / Function | Dependencies |
|----|----|----|
| `spring-petclinic-admin-server` | Provides admin UI and dashboards | Microservices, Config Server |
| `spring-petclinic-api-gateway` | Routes external requests to microservices | Customers, Vets, Visits, GenAI services |
| `spring-petclinic-config-server` | Centralized configuration | Git repo |
| `spring-petclinic-customers-service` | Manages customer data | RDBMS, Config Server |
| `spring-petclinic-vets-service` | Manages veterinary staff | RDBMS, Config Server |
| `spring-petclinic-visits-service` | Manages pet visit records | RDBMS, Customers Service |
| `spring-petclinic-genai-service` | Optional AI chat-bot | Microservices, RDBMS |
| `spring-petclinic-discovery-server` | Service registry / discovery | All microservices |
| RDBMS | Persistent storage | Customers, Vets, Visits |

## Architecture and Specifications
### Project
- Kanban as agile methodology
- Breakdown of work and phases:
	- Infrastructure Setup
	- Service Orchestration
	- Configuration Management
	- CI/CD Automation
	- Security
	- Resilience
	- Observability
#### Assignments:

| Role | Responsibilities |
|----|----|
| **Cloud Architect** | Design AWS target architecture, network, and IAM structure |
| **DevOps Engineer** | Build CI/CD pipelines, container orchestration, monitoring setup  |
| **Dev Lead** | Containerize services, modify configs for cloud compatibility |
| **Database Engineer** | Migrate data from local RDBMS to AWS RDS, manage schema updates |
| **Security Team** | Set up access and roles for services |
| **Everyone** | Validate deployments, pipeline runs, rollback testing |
| **Project Lead** | Manage Asana Kanban board, ensure alignment and progress tracking |

### Source Code

- Architecture Type: Microservices deployed via containers, managed by ECS, behind an AWS Application Load Balancer (ALB).
- Review via pull request process:
	- All commits merged via PRs.
	- Peer review required before merging.
- Vaildation: Run tests during pipeline build

## CI/CD

- Goal: Automate the entire software delivery process for all PetClinic microservices via Jenkins

### Development Cycle Stages

| Stage | Description |
|----|----|
| **01. Code & Merge** | Developer writes code and merges it with the staging branch |
| **02. Build** | Compile, resolve dependencies |
| **03. Unit Test** | Run service-level tests |
| **04. Containerize** | Build Docker image for service |
| **05. Security Scan** | Security validation via Trivy |
| **06. Push to Registry** | Push validated image |
| **07. Deploy to Staging** | Deploy for validation |
| **08. Integration tests** | Validate service communication|
| **09. Deploy to Production** | Promote validated build |
| **11. Observability Check** | Validate monitoring and alerts |

### Jobs and environments

- Each microservices has his own Jenkins pipeline per environment.

| Environment | Purpose | Infrastructure |
|----|----|----|
| **Development (Local)** | Local testing, feature validation | Docker Compose |
| **Staging (AWS)** | Integration and pre-prod testing | ECS/EKS (staging cluster), RDS (test DB) |
| **Production (AWS)** | Live system | ECS/EKS (prod cluster), RDS (prod DB) |

 ## Storage

| **Type** | **Service** | **Use / Description** | **IOPS / Performance** | **Volume / Size** | **Backup Strategy** |
|-----|-----|-----|-----|-----|-----|
| **1. Database (RDBMS)** | Amazon RDS (MySQL) | Structured data for each microservice schema | 3,000–6,000 (gp3 default) or provisioned as needed | 20 GB per schema  | Automated daily snapshots (14-day retention) |
| **2. Block Storage** | Amazon EBS (gp3) | EC2-hosted Jenkins & ECS servers| 3,000 baseline | / | Not necessary |
| **3. Object Storage** | Amazon S3 | Logs, backups, images | Standard or Infrequent Access tiers | / | Cross-region replication or versioning enabled |


## Data

### 1. Location
- Eu-central-1 region
- Place database (RDS) and services in the same region and AZs.

### 2. Replication / Distribution
| Data Type | Replication / Distribution Strategy |
|----|----|
| **RDS (Postgres/MySQL)** | Multi-AZ synchronous replication |
| **S3 (images, artifacts)** | Automatic cross-AZ durability |

### 3. Links / Access

| Access type | Route|
|----|----|
| **Internal** | Microservices access RDS via private VPC links. |
| | Images in S3 accessed via IAM roles or pre-signed URLs. |
| **External** | ALB routes external requests. |
| | HTTPS enforced for secure data transfer. |

## Network

### Location

- Eu-central-1 region
- Deploy services across multiple AZs for high availability.
- All microservices, databases, and supporting infrastructure live inside a single VPC.
- Isolate the network from public internet by default

### Network Segmentation & Filtering
 - Public subnets: ALB, NAT gateway.
 - Private subnets: ECS, RDS.
 - Security groups: Service-specific firewall rules
 - Tweak default ACLs if necessary

 ### Addressing

- VPC: `10.0.0.0/16`
- Public subnet: `10.0.1.0/24`
- Private subnet: `10.0.2.0/24`
- Every service/database gets a internal IP in the private subnet
- Only load balancer or NAT gateway have public IP

## Compute

### Nodes

| Environment | Nodes | Notes |
|----|----|----|
| **Staging** | 3 ECS container instances (EC2) | Handles staging microservices, mirrors production setup |
| **Production / Live** | 3 ECS container instances (EC2) | Fixed-size cluster, no autoscaling to reduce costs |
| **Scalability** | N/A for autoscaling | Fixed node count to reduce cost but still allow horizontal scaling via ECS task count or manual node addition. |

### Container Management

#### Container Registry:
- Amazon ECR for all microservice Docker images.
- Each microservice image tagged by Git commit SHA.

#### Deployment Strategy:

- ECS tasks run one or more containers per node.
- Service definitions ensure each microservice has the desired number of tasks.
- Jenkins updates ECS service definition after build.

### ECS Orchestration

#### Cluster Setup:
- One ECS cluster per environment (staging and production).
- EC2 launch type for fixed nodes.

#### Service Definitions:
- Each microservice has an ECS service with a desired task count.
- Service linked to ALB .

## Security

| **Area** | **Implementation / Notes** |
|----|----|
| **1. Authentication, Authorization, Auditing (AAA)** | Spring Security |
| | IAM roles restrict AWS access per service |
| | Auditing: Not relevant since we don't handle sensitive data|
| | CloudWatch for app/service logs |
| **2. Code Security** | Static analysis via SonarQube|
| | No hardcoded credentials |
| | Secrets in AWS Secrets Manager |
| | Dependency scanning  via Dependabot |
| **3. Traffic Security** | HTTPS enforced via ALB |
| | Internal TLS optional for microservices |
| | Security groups restrict inbound/outbound ports |
| | Private subnets for internal services and databases |
| **4. Instance / Container Security** | Use minimal and updated AMIs |
| | Regular patching, no direct SSH (bastion-only) |
| | Containers run as non-root users |
| | Vulnerability scanning before deploy |
| | Secrets passed via IAM roles or ECS environment vars |

## Observability
| **Aspect**  | **Tools** | **Notes** |
|----|----|----|
| **Metrics** | **Prometheus** | Collect CPU, memory, and ECS task metrics from node exporters |
| | | If microservices expose `prometheus-metrics`, integrate directly. |
| | **Grafana** | Dashboards for system and service health |
| **Logs** | **AWS CloudWatch Logs** | ECS task logs streamed to CloudWatch|
| | |Structured JSON logging for easy filtering and search.|
| **Traces**  | **AWS X-Ray** | Trace API calls across microservices. |
| **Alerts**  | **CloudWatch Alarms** | CloudWatch for infrastructure-level alerts (CPU, memory, ECS health)
| | **Grafana Alerts** | Grafana alert rules for application metrics from Prometheus. |
| | | Alerts via email or Slack webhook.|

## Continuity & Recovery

| **Aspect** | **Approach / Tooling** | **Notes** |
|----|----|----|
| **Redundancy** | Multi-AZ deployment | RDS and ECS nodes deployed across multiple Availability Zones for high availability.|
| | | Load balancer automatically routes traffic to healthy tasks. |
| **Failover** | AWS-managed failover | RDS Multi-AZ provides automatic database failover.
| | |ECS services automatically restart failed tasks on healthy nodes.|
| | | Manual intervention only needed for regional failures. |
| **Backup** | AWS Backup / RDS Snapshots| Automated RDS daily backups with retention policy.
| | S3 Versioning | S3 bucket versioning for uploaded images and configs.|
| **Business Continuity Plan** | Operate from secondary region if needed | Documented procedure to restore environment in another AWS region using IaC templates (Terraform). |
| | | Prioritize restoring RDS, Config Server, and API Gateway. |
| **Disaster Recovery Plan** | Cold standby in alternate region | No live duplication to save cost.|
| | | Periodic replication of backups and images to secondary region. |

## Architecture Diagram
<p align="center">
  <img src="assets/week1/aws-architecture-diagram.png" alt="Main Menu"/>
</p>

## Solutions stack

| **Layer** | **Technologies / Services** |
|----|----|
| **Application Layer** | Spring Boot microservices |
| **Runtime / Platform Layer** | Docker, Amazon ECS, Amazon ECR |
| **CI/CD Layer** | Jenkins, Gitea |
| **Infrastructure Layer** | Terraform, Ansible, Amazon EC2, VPC, subnets, security groups |
| **Database / Storage Layer** | Amazon RDS (MySQL), Amazon S3, Amazon EBS |
| **Observability Layer** | Prometheus, Grafana, CloudWatch |
| **Security Layer** | AWS IAM, Security Groups, HTTPS via ALB, Secrets Manager |
| **Continuity & Recovery Layer** | RDS automated snapshots, S3 versioning/replication, multi-AZ RDS, Terraform for redeploy |
| **Network & Delivery Layer** | Application Load Balancer (ALB), Route 53, NAT Gateway, Internet Gateway |