Hey! I'm an experienced Data Engineer with over years of expertise in building fault-tolerant systems for high-volume data. My skillset spans ETL pipelines, streaming data, and the fascinating world of distributed computing.
☁️ A maestro in Azure and AWS, I orchestrate data symphonies with PySpark and Airflow. Fluent in Python, I navigate data seamlessly using Kafka's magic.
📊 I'm not just crunching numbers; I'm a master storyteller with SQL queries, shaping real-time insights. As a web scraping maestro, I extract gems from the digital landscape.
📈 Bringing data to life with dazzling dashboards using Plotly Dash, I transform the web into a captivating stage for insightful performances! 🎭✨
Clickstream ETL (Extract, Transform, Load) pipeline designed to consume real-time data from a Kafka and Zookeeper cluster. The pipeline utilizes Python for initial data processing and enrichment before storing the raw clickstream data in Apache Cassandra. The raw data in Cassandra is then processed and enriched using PySpark on Databricks to perform complex transformations. Finally, the processed data is stored in Elasticsearch for efficient querying and analysis.
A distributed system that detects and analyzes technologies used across 200M+ domains. Utilizes Playwright for HTML extraction through proxies, along with predefined patterns and regex matching to identify web technologies. Built with Python and RabbitMQ for distributed processing across multiple servers.