Data Cleaning with OpenRefine (Non-Coding) (Pre-hackathon Workshop)
CUHK Data Hack 2025 Pre-hackathon Workshop
* This workshop ONLY opens to CUHK community who have registered to join the CUHK Data Hack 2025*
Instructor: Dr. Michael YU, Department of Computer Science and Engineering, CUHK
Description:
Data accessibility does not always translate to immediate usability. Examples of data issues include incompleteness, inaccuracy, and incompatible schemas. Examples of incompatible schemas may be different column definitions or data types; basically, formats that are not suitable for the intended use.
This is understandable, as data collection and processing often involve different individuals. When using data sources from other people, this is an issue that we will always have to address. They are particularly prevalent in larger and richer datasets, as well as when integrating data from diverse sources.
There are several approaches to addressing incompatible schemas. One method involves manually editing the data to ensure compatibility, which is time-consuming and error-prone especially when dealing with large datasets. A more efficient alternative lies in data transformation tools that automate schema conversion, ensuring accuracy and saving time. In this context, Open Refine emerges as a valuable open-source tool that aids in resolving these prerequisite issues, paving the way for subsequent analysis.
Our three-hour workshop aims to equip students with the ability to identify schema differences and explore common data incompatibilities. Through hands-on exercises, participants will import data from diverse sources, rectify errors, standardize formats, generate new columns and calculations, and ultimately export the processed data in desired formats. Time permitting, we will delve into real-world datasets, applying data cleaning techniques to enhance their usability.
Registration: CUHK Data Hack 2025 participants ONLY. Join CUHK Data Hack 2025 now!