- 🔭 Currently, my main interest is developing Data related projects, such as ETL (Extract, Transform and Load) Pipelines on the cloud, WebScrapping with Python and Data Warehouses for Analytics.
- ⚙️ Besides data-related projects, I am also highly interested in low-level programming using C and C++. I have developed several projects in this domain, including embedded systems on an ESP-32 and interactive games using threads in these languages.
- 🌱 I am learning how to use various technologies, a few examples are: Python, Pandas, Airflow, AWS, SQL, Postgres, Selenium, Langchain, C++, C
- 📫 You can reach me at the email [email protected].
Data Warehouse and automatic ETL pipeline for extracting and analyzing public brazilian goverment data with interactive Dashboards
This project aims to develop a Data Warehouse (DW) that consolidates multiple public government data points over several years, focusing on socio-economic indicators. The DW will support analytical queries and time-series analysis, providing decision-makers with deeper insights into areas such as Economic Activity, Environmental Policies and Damage, and Public Health. Additionally, the project features an ETL pipeline to automate the collection, transformation, and loading of data from public sources into the DW. The end goal is to use the DW too serve interactive dashboards to allow for easier analysis of this data.
The project builds upon the educational capabilities of Large Language Models (ex: GPT-3.5 and GPT-4) for education ,while also mitigating weaknesses such as hallucination and lack of knowledge about certain subjects and tests within the brazilian university admittance standardized test (ENEM).
To achieve these results an LLM application, using openAI models (gpt-3.5 turbo or gpt-4), along with aditional modules, such as internet search and retrieval augmented generarion for extra functionality, was developed.
According to feedback, over 60% of users said our solution has better and more accurate answers than chatGPT
Helpful Prompts and data extracted from official sources about the ENEM test was used for better results.
For the purpose of RAG over ENEM test questions a GPT action and its associated API was used, the API is hosted on AWS API gateway and uses a Lambda Function for taking user inputs, embedding them with openAI embeddings and then querying Qdrant vectorDB for the N questions more similar to user input, with N being the number of questions the user asked.
For the educational chatbots, both the website and the customGPT version, i needed a large dataset of ENEM questions and their correct answers for the purpose of RAG and reduce LLM hallucinations (such as giving the wrong answer to a question) but no such large scale data was available online.
In such context i created this project, which combines PDF/data mining through libraries like PyMuPDF2 to transform the ENEM pdf into either textual data or into JSON files (Extraction and Transform part) and then a Qdrant VectorDB loader to load the data into the vectorstore (Load part). That combination is able to process either single tests PDFs (and their associated answer PDFs) or entire folders with multiple tests, loading hundreds of questions at once, all while providing metadata and stats about the extraction process (number of extracted questions per year and subject) to a CSV file, through a Pandas DataFrame.
This project aims to collect and update data on cryptocurrencies like Bitcoin and Ethereum, storing the information in CSV files. These files cover extensive periods of trading data collected from the Binance US API.
The main technologies used are AWS Cloud (Lambda, API gateway, EC2 and S3), Apache Airflow for Data pipeline orchestration, Python and Pandas for manipulating the data and Plotly for the creation of nice plots of the Data.
Heres the architecture of the Project/Pipeline:
Here is one example plot generated by the project, taken from one of mine linkedin posts
This project implements a Python script for data analysis and visualization with plots based on data from IPEA (Institute for Applied Economic Research) and its "Map of Violence" database. It allows analyzing data such as Homicide and Suicide Rates by state and year. The dataset also includes gender-separated data, enabling a historical series analysis of violence against women.
The program automatically generates plots, one for each specified year, based on the retrieved data. The plots will be saved in the current directory.
Project developed in group for an eletronics class in university and presented on the Undergrad Research Symposium of Universidade of São Paulo (SIICUSP 2023).
The goal of this effort was the integrate Machine Learning Models , such as Computer vision and text classification, with a robot powered by a microcontroller (ESP-32)
My main contribution was with software development for the ESP-32 embedded systems, using C++ and modules such as Wi-Fi HTTP request handlers.
Heres the certificate for the Symposium
This project was developed as part of an undergrad course in Operating Systems, with the main goal being to create an interactive game displayed on the terminal using Threads and Mutexes to allow for concurrent operations, such as rendering the game board, getting user input, moving the game elements among others.
One of the main benefits from this project was my further familiarization with C++ stdlib functions, classes and structures for working in a multi-thread environment.