Cleanlytics AI

An AI-powered enterprise data quality platform built with Python and Streamlit to automate data profiling, cleaning, schema inference, anomaly detection, and reporting workflows.

PythonStreamlitPandasNumPyScikit-learnPlotlyReportLabMachine LearningData Cleaning

View on GitHub Discuss This Project

Key Metrics & Results

AI-Powered

Platform Type

Built an intelligent data quality automation platform for profiling, cleaning, validation, and reporting.

Outlier Detection Methods

Implemented IQR, Z-Score, and Isolation Forest for statistical and ML-based anomaly detection.

Multi-Page

Dashboard Architecture

Designed a scalable Streamlit application with modular pages for profiling, cleaning, reports, and insights.

Automated

Reporting System

Added PDF report generation, dataset export, backup recovery, and cleaning audit logs.

Project Overview

Cleanlytics AI is an AI-powered enterprise data quality platform designed to reduce manual data cleaning effort. The platform allows users to upload datasets, analyze data quality issues, detect missing values and outliers, infer correct data types, apply cleaning actions, generate reports, and export cleaned datasets through an interactive dashboard interface.

The Problem

Real-world datasets are often messy, incomplete, inconsistent, and difficult to prepare manually. Data analysts spend significant time identifying missing values, incorrect data types, duplicate records, outliers, and formatting issues before analysis or machine learning can begin.

The Solution

I built Cleanlytics AI to automate major parts of the data preparation workflow. The platform performs intelligent data profiling, recommends optimal data types using schema inference logic, detects anomalies using statistical and machine learning methods, supports automated cleaning actions, maintains audit logs, and generates downloadable reports.

Data Visualizations & Insights

Intelligent Data Profiling

Insight

Automatically analyzes dataset structure, missing values, column types, unique values, and data quality issues.

AI Schema Inference

Insight

Recommends suitable data types such as integer, float, datetime, category, and object based on value patterns and conversion ratios.

Outlier Detection

Insight

Detects abnormal values using IQR, Z-Score, and Isolation Forest with options to fix or handle detected anomalies.

Automated Reporting

Insight

Generates professional PDF reports and exports cleaned datasets for future analysis.

Business Impact

Reduces manual data cleaning time for analysts.
Improves dataset quality before analytics or machine learning.
Helps users detect hidden issues such as wrong data types, missing values, and anomalies.
Provides automated reporting and audit logs for transparency.
Demonstrates practical data analytics, machine learning, and dashboard development skills.

Technologies & Tools

Python

Used as the core programming language for data processing, cleaning logic, and backend workflows.

Streamlit

Used to build the interactive multi-page dashboard and responsive UI.

Pandas & NumPy

Used for dataset manipulation, missing value handling, profiling, and numerical operations.

Scikit-learn

Used for machine learning-based anomaly detection with Isolation Forest.

Plotly

Used for interactive visualizations and dashboard charts.

ReportLab

Used for automated PDF report generation.

Key Features

Dataset upload and preview
Automated data profiling
Missing value detection
Duplicate record detection
AI-based datatype recommendation
Schema inference with confidence logic
IQR outlier detection
Z-Score outlier detection
Isolation Forest anomaly detection
Automated fixing strategies
Cleaning audit logs
Backup and recovery system
PDF report generation
Cleaned dataset export
Responsive SaaS-style dashboard UI
Modular multi-page architecture

Challenges & Solutions

Handling Messy Real-World Data

Solution Applied:

Built automated profiling and cleaning workflows to detect and handle common data quality issues including missing values, incorrect formats, wrong data types, duplicate records, and inconsistent values.

Datatype Recommendation

Solution Applied:

Created schema inference logic using numeric conversion ratio, datetime parsing success, uniqueness ratio, and pattern-based checks.

Outlier Detection

Solution Applied:

Implemented IQR, Z-Score, and Isolation Forest so users can compare statistical and ML-based detection methods.

Project Usability

Solution Applied:

Designed a modern Streamlit dashboard with clean layout, modular navigation, action buttons, cards, and report downloads.

Future Enhancements

Add NLP-based column meaning detection
Add smart cleaning recommendation engine
Add automated data quality scoring
Add database connection support
Add user authentication
Add cloud deployment option
Add advanced ML-based data validation
Add dashboard export and sharing feature

Interested in this project?

I'd love to discuss the technical details, methodology, and learnings from this project. Feel free to reach out to learn more!

Get in Touch View More Projects