Cleanlytics AI

An AI-powered enterprise data quality platform built with Python and Streamlit to automate data profiling, cleaning, schema inference, anomaly detection, and reporting workflows.

PythonStreamlitPandasNumPyScikit-learnPlotlyReportLabMachine LearningData Cleaning

Key Metrics & Results

AI-Powered
Platform Type
Built an intelligent data quality automation platform for profiling, cleaning, validation, and reporting.
3+
Outlier Detection Methods
Implemented IQR, Z-Score, and Isolation Forest for statistical and ML-based anomaly detection.
Multi-Page
Dashboard Architecture
Designed a scalable Streamlit application with modular pages for profiling, cleaning, reports, and insights.
Automated
Reporting System
Added PDF report generation, dataset export, backup recovery, and cleaning audit logs.

Project Overview

Cleanlytics AI is an AI-powered enterprise data quality platform designed to reduce manual data cleaning effort. The platform allows users to upload datasets, analyze data quality issues, detect missing values and outliers, infer correct data types, apply cleaning actions, generate reports, and export cleaned datasets through an interactive dashboard interface.

The Problem

Real-world datasets are often messy, incomplete, inconsistent, and difficult to prepare manually. Data analysts spend significant time identifying missing values, incorrect data types, duplicate records, outliers, and formatting issues before analysis or machine learning can begin.

The Solution

I built Cleanlytics AI to automate major parts of the data preparation workflow. The platform performs intelligent data profiling, recommends optimal data types using schema inference logic, detects anomalies using statistical and machine learning methods, supports automated cleaning actions, maintains audit logs, and generates downloadable reports.

Data Visualizations & Insights

Intelligent Data Profiling

Insight

Automatically analyzes dataset structure, missing values, column types, unique values, and data quality issues.

AI Schema Inference

Insight

Recommends suitable data types such as integer, float, datetime, category, and object based on value patterns and conversion ratios.

Outlier Detection

Insight

Detects abnormal values using IQR, Z-Score, and Isolation Forest with options to fix or handle detected anomalies.

Automated Reporting

Insight

Generates professional PDF reports and exports cleaned datasets for future analysis.

Business Impact

  • Reduces manual data cleaning time for analysts.
  • Improves dataset quality before analytics or machine learning.
  • Helps users detect hidden issues such as wrong data types, missing values, and anomalies.
  • Provides automated reporting and audit logs for transparency.
  • Demonstrates practical data analytics, machine learning, and dashboard development skills.

Technologies & Tools

Python

Used as the core programming language for data processing, cleaning logic, and backend workflows.

Streamlit

Used to build the interactive multi-page dashboard and responsive UI.

Pandas & NumPy

Used for dataset manipulation, missing value handling, profiling, and numerical operations.

Scikit-learn

Used for machine learning-based anomaly detection with Isolation Forest.

Plotly

Used for interactive visualizations and dashboard charts.

ReportLab

Used for automated PDF report generation.

Key Features

  • Dataset upload and preview
  • Automated data profiling
  • Missing value detection
  • Duplicate record detection
  • AI-based datatype recommendation
  • Schema inference with confidence logic
  • IQR outlier detection
  • Z-Score outlier detection
  • Isolation Forest anomaly detection
  • Automated fixing strategies
  • Cleaning audit logs
  • Backup and recovery system
  • PDF report generation
  • Cleaned dataset export
  • Responsive SaaS-style dashboard UI
  • Modular multi-page architecture

Challenges & Solutions

Handling Messy Real-World Data

Solution Applied:

Built automated profiling and cleaning workflows to detect and handle common data quality issues including missing values, incorrect formats, wrong data types, duplicate records, and inconsistent values.

Datatype Recommendation

Solution Applied:

Created schema inference logic using numeric conversion ratio, datetime parsing success, uniqueness ratio, and pattern-based checks.

Outlier Detection

Solution Applied:

Implemented IQR, Z-Score, and Isolation Forest so users can compare statistical and ML-based detection methods.

Project Usability

Solution Applied:

Designed a modern Streamlit dashboard with clean layout, modular navigation, action buttons, cards, and report downloads.

Future Enhancements

  • Add NLP-based column meaning detection
  • Add smart cleaning recommendation engine
  • Add automated data quality scoring
  • Add database connection support
  • Add user authentication
  • Add cloud deployment option
  • Add advanced ML-based data validation
  • Add dashboard export and sharing feature

Interested in this project?

I'd love to discuss the technical details, methodology, and learnings from this project. Feel free to reach out to learn more!