Research Data Management Development Fund (2022) – The Chinese University of Hong Kong Library

The Research Data Management Development Fund was first established in 2022 by the Committee on Research Data Management of the University’s Research Committee. It aims to encourage CUHK researchers and students to embed research data management (RDM) practices through entire research life cycle, improve data management process, and promote the use of research data facilities and services at CUHK. All full-time CUHK staff members on professoriate or research academic ranks (i.e. from “Research Assistant Professor” to “Professor”) were invited to apply for the funding as principal investigator.

The Fund supported six projects in 2022 with a total of HKD 600,000 being awarded. Details of the projects are as follows:

Faculty of Education

Educational Opportunities and Social Mobility in Greater Bay Area (GBA): A Comparative Analysis in Hong Kong and Guangdong

Faculty of Engineering

A Multi-level Knowledge-driven Customized Dialogue Dataset for General Use
Fake News Detection via Graph Neural Networks

Faculty of Science

Facilitation of Simulation Force Field Development with Enhanced Research Data Management
The Greater Bay Area and Global Climate Data Repository: Analysis, Visualization, Education

Faculty of Social Science

Tracking Journalistic Sources in Chinese Media (2022-23)

Faculty of Education
Title	Educational Opportunities and Social Mobility in Greater Bay Area (GBA): A Comparative Analysis in Hong Kong and Guangdong
Principal Investigator/ Co-investigator(s)	Name	Affiliation
Principal Investigator	OU Dongshu	Department of Educational Administration and Policy
Co-investigator	WONG Kenneth K.	Education Department, Brown University
Abstract	Research is scant on educational opportunities and social mobility in Greater Bay Area including Hong Kong and Guangdong province. While both regions have witnessed greater educational expansion and prosperous economic growth in the past two decades, they also face challenges in intergenerational inequalities and assimilation. Using Hong Kong Census and By-Census data and China Family Panel Studies, we analyze three topics: (1) educational mismatch and labor market consequence of male workers, (2) returns to education and economic assimilation of high-skilled migrant workers, and (3) regional disparity of access and quality of school across GBA. Our study provides an updated picture on the trend of social inequality in Hong Kong and Guangdong province as well as a comparison of the two education systems and labor markets. Research results have implications on the utilization and reward of the human capital in the labor market of GBA area and understanding of potential barriers that create social inequality in the two societies.
Start Date	1-Sep-2022
End Date	31-Aug-2023
Project Report Summary	We analyse the economic development in the 11 cities in the Guangdong province, Hong Kong and Macau between 2018 and 2020. Absolute gross domestic product, gross domestic product per capita are compared, and significant differences are observed across cities and over time. This implies the differential negative impact of the COVID-19 pandemic on the regional economy. On the other hand, we investigate the difference in the share of economic sectors, which can be used to identify the key drivers of regional economic growth and explain the variation in responses to the health shock during the period. To explain the difference in economic growth in terms of human capital accumulation, we also examine the pupil-teacher ratio as a measure of school quality at different levels of education. Next, we look at the trends in population to see if this can explain the disequilibrium in the regional labour market. We summarise the results in the policy brief articles into charts and figures, which are shared in the CUHK Research Data Repository to the public. These graphical outputs give the readers an update about the shift of economic and educational inequalities in the main GBA cities. They also provide implications on the urbanisation and returns to education, which helps understand the challenges and barriers in the current GBA labour market. The funding was used to hire student helper and support part of the salary of research staff in working in this project. They were trained to use the CUHK library resources to collect data and manage the online data. From this project, they also learnt how to handle sensitive data and handle with data privacy issues.
DMPID/URL	https://dmphub.uc3prd.cdlib.net/dmps/10.48321/D10M2X
Dataset DOI	https://doi.org/10.48668/YTVMXE
Related Publication	Policy Brief RECEPD No. 2023 – 001, Policy Brief RECEPD No. 2023 – 002, Policy Brief RECEPD No. 2023 – 003, Policy Brief RECEPD No. 2023 – 004

Faculty of Engineering
Title	A Multi-level Knowledge-driven Customized Dialogue Dataset for General Use
Principal Investigator/ Co-investigator(s)	Name	Affiliation
Principal Investigator	WONG Kam Fai	Department of Systems Engineering and Engineering Management
Co-investigator	HO Hon Shan	Department of Systems Engineering and Engineering Management
Abstract	Dialogue systems have been widely applied in the applications such as Apple’s Siri and Google Voice. It helps us to control our smart devices and to complete tasks, such as making an order in a restaurant or booking a flight ticket with our mobile phone. Unfortunately, there is no dialogue system available for the Small and Middle Enterprises (SMEs) in Hong Kong due to the limited Cantonese dialogue dataset information. For this reason, we propose the first Cantonese Knowledge-Driven Dialogue Dataset for General Use (CK-DDD) in Hong Kong, which collects the information in multi-turn conversations from various sources. Now our corpus contains over 800,000 conversational datasets obtained from over a dozen sources in different business sectors. These datasets will be processed and then make accessible to the public via our online repository. CK-DDD will escalate the utilization of Cantonese dialogue systems in various commercial applications, for example, Live AI chatbot for sales inquiry. The corpus will be constantly updated and expanded if the usage of Cantonese becomes increasingly popular, and the information provided will be more sentimental and reliable. Eventually, we will organize this dialogue dataset as self-sustainable and freely accessible information to the public for research and development purpose
Start Date	1-Oct-2022
End Date	30-Sep-2023
Project Report Summary	This project is all about bringing innovative technology to restaurants in Hong Kong. Think of it like having a digital helper, similar to Siri or Alexa, but designed specifically for Cantonese-speaking restaurants. This is a big deal because small and medium-sized restaurants in Hong Kong don’t have access to this kind of technology in their native language. Our team is working on a system that can hold conversations in Cantonese. We’ve created a special database called the Cantonese Knowledge-driven Dialogue Dataset for REStaurants (KddRES), which will be available online for everyone. This database contains 800 conversations from 10 different restaurants, each with its own unique style. We’re committed to regularly updating this dataset to ensure it remains a valuable resource for all. Our project has three main goals. First, we’ll share the KddRES dataset with the public, making it a valuable resource. Second, we’re developing a dialogue system that can be customized for any restaurant, which is great news for small and medium-sized eateries. Third, we’ll publish a research paper to tell the academic community about our innovative approach. To make sure the data we collect is top-notch, we’ve hosted and organized a series of workshops and meetings. We’re keeping a close eye on data quality and keeping thorough records. This project is backed by CUHK research data management tools, ensuring our data is well-managed and accessible for future use. We’re using Github to share our project’s structure and CUHK data center for downloading the dataset. So, it’s not just about creating a unique dialogue system for restaurants – it’s also about promoting good data management practices.
DMPID/URL	https://dmphub.uc3prd.cdlib.net/dmps/10.48321/D11C83
Dataset DOI	https://doi.org/10.48668/JNCJ8P
Related Publication	Wang, H., Li, M., Zhou, Z., Fung, G.P., & Wong, K.-F. (2020). KddRES: A Multi-level Knowledge-driven Dialogue Dataset for Restaurant Towards Customized Dialogue System. ArXiv. https://doi.org/10.48550/arxiv.2011.08772.

Faculty of Engineering
Title	Fake News Detection via Graph Neural Networks
Principal Investigator/ Co-investigator(s)	Name	Affiliation
Principal Investigator	WANG Sibo	Department of Systems Engineering and Engineering Management
Abstract	Online Social Networks, the public platform where everyone can express and communicate, not only facilitate users to read and discuss news, but also lower the threshold of “we media” so that everyone can become the source of news, which inevitably makes social network a hotbed for fake news. From the 2016 US election to the COVID-19 epidemic and the outbreak of the Russia-Ukraine war, lots of fake news has swept online social networks. Fake news such as disinfection water killing coronavirus has brought threats to our psychological and physical health. For the task of fake news detection, compared with traditional news websites, social networks have a unique feature of news transmission topology. How to use powerful methods like GNN to model the content feature and topology of news is an important direction in this research field. We aim to detect fake news by generating critical diffusion structures that determine the veracity of the news. Besides, datasets used in the existing work are old and do not contain the latest hot events in recent years. Therefore, we will collect an up-to-date fake news dataset on social networks to ensure that detection methods remain viable in the new era.
Start Date	1-Oct-2022
End Date	31-Mar-2024
Project Report Summary	The proliferation of rumors on social media platforms during significant events, such as the US elections and the COVID-19 pandemic, has a profound impact on social stability and public health. Our project focuses on enhancing the detection of rumors on social media based on the characteristics of propagation patterns. Existing methods struggle when the data contains irrelevant or misleading information, especially when rumors are still emerging and lack detailed propagation paths. To address these challenges, we developed the Key Propagation Graph Generator (KPG), a sophisticated tool that uses reinforcement learning to analyze and generate useful social media data structures for rumor detection. The KPG framework has two main components: the Candidate Response Generator (CRG) and the Ending Node Selector (ENS). The CRG filters out irrelevant data and creates new data points that help in identifying the spread of information, while the ENS pinpoints key patterns in the data that are indicative of misinformation or trustworthy information. Extensive experiments conducted on four datasets demonstrate the superiority of our KPG compared to the state-of-the-art approaches. Furthermore, our project has constructed a new dataset using the Weibo platform, focusing on the spread of COVID-19 information from November 2019 to March 2022. This dataset includes detailed records of how each piece of news was shared across social media, including who shared it and the network connections among the shares. 2,087 rumors and 2,087 non-rumors are contained in the dataset. The rumors comes from Weibo Community Management Center, an official service where users can report either a microblog that contains false information, and the on-rumors comes from official accounts, the rumor-refuting microblogs provided by Weibo Community Management Center and China Internet Joint Rumor Debunking Platform. The results and methodologies developed from this project are expected to impact both academic research and practical applications in social media analytics, especially in strengthening the public’s resilience against misinformation.
DMPID/URL	https://dmphub.uc3prd.cdlib.net/dmps/10.48321/D15D2FCDEC
Dataset DOI	https://doi.org/10.48668/IPJLUK
Related Publication	A conference paper entitled ‘Rumor Detection on Social Media with Reinforcement Learning-based Key Propagation Graph Generator’ by Yusong Zhang et al. directly emerged from this project. Yusong Zhang, Kun Xie, Xingyi Zhang, Xiangyu Dong, Sibo Wang: KPG: Key Propagation Graph Generator for Rumor Detection based on Reinforcement Learning. CoRR abs/2405.13094 (2024)

Faculty of Science
Title	Facilitation of Simulation Force Field Development with Enhanced Research Data Management
Principal Investigator/ Co-investigator(s)	Name	Affiliation
Principal Investigator	TSE Ying Lung	Department of Chemistry
Abstract	At the heart of a molecular dynamics (MD) simulation is the force field that describes the molecular interactions. As the computer algorithms become more sophisticated, the development becomes more complicated, and it has become increasingly difficult to describe all the details of a force field in journal papers or writing in general. Manually inputting the numerical data is prone to typographical errors, but even an apparently small difference in the force field could lead to drastically different numerical results. Fortunately, such issues can be readily resolved by good research data management by sharing the simulation files and the instructions on a reliable/reputable platform. CUHK DMP paired with CUHK Research Data Repository recently introduced would be an ideal platform for hosting and sharing the project description, simulation files, and instructions with the other researchers.
Start Date	1-Dec-2022
End Date	31-May-2024
Project Report Summary	Thanks to the Research Data Management (RDM) development fund, our project has gained a deeper understanding of how RDM can facilitate the sharing of valuable data for molecular dynamics simulations that utilize the DeepMD kit. The fund enabled us to hire a student helper and acquire additional computer storage, crucial for handling the large volumes of training data needed for our neural network models. Our research involves creating neural network models with the DeepMD kit, a machine-learning toolkit designed for training artificial neural networks using computationally intensive reference data. The models we develop are stored as graph files on researchdata.cuhk.edu.hk. Despite the large amount of data required for training, the final model files are compact, consisting mainly of the network weights. This compactness is similar to large-language models like ChatGPT, which, although trained on extensive datasets, are ultimately defined by their weights. These models are easily accessible to other researchers via our data repository, simplifying the process for those interested in applying our methods to their own molecular dynamics studies. This accessibility not only fosters collaboration but also accelerates advancements in the field by providing a ready-to-use resource that can be seamlessly integrated into diverse projects.
DMPID/URL	https://dmphub.uc3prd.cdlib.net/dmps/10.48321/D1W035
Dataset DOI	1)DeepMD model of the air-water interface: https://doi.org/10.48668/OX2ZCD 2)DeepMD model of water-oil interfaces: https://doi.org/10.48668/WVQDAK 3)DeepMD model of water-oil interfaces with catalyst: https://doi.org/10.48668/OCLLON

Faculty of Science
Title	The Greater Bay Area and Global Climate Data Repository: Analysis, Visualization, Education
Principal Investigator/ Co-investigator(s)	Name	Affiliation
Principal Investigator	TAM Chi Yung Francis	Earth System Science Programme
Co-investigator	AU-YEUNG YEE Man Andie	Earth System Science Programme
Co-investigator	LI Kwan Kit Ronald	Earth System Science Programme
Co-investigator	YEUNG Kam Lam Paul	Earth System Science Programme
Abstract	The proposed project plans to launch a climate data repository for simulated data from regional and global climate models from our research group. The data uploaded provide the historical and future climate information. They are useful for projections of the future climate, atmospheric environment, hydrology, agriculture, wind power, etc. both globally and in the Greater Bay Area. These data will be accessible to all interested parties, especially those working in related fields. However, these output data from the research models are usually in NetCDF format, which might be challenging to be analyzed and visualized directly. Therefore, it is also proposed to develop a set of codes and programmes to convert these data into easily analyzable figures and graphs for those who might not be familiar with working with these output data files. Together with the codes, the climate data repository provides an excellent tool for the users who might not have strong coding and climate background like the undergraduates in different majors, media, and the general public, to visualize and analyze the atmospheric circulation for the present as well as future climate.
Start Date	1-Jan-2023
End Date	31-Dec-2023
Project Report Summary	A climate data repository for simulated data from regional and global climate models from our research group, namely the bias-corrected CMIP6 global dataset for dynamical downscaling for the historical and future climate (1979–2100), has been launched at the CUHK Research Data Repository and is open to public access. The data provides global historical and future climate information, useful for dynamical downscaling projections of the Earth’s future climate, atmospheric environment, hydrology, agriculture, wind power, etc. README files describing the raw data files, as well as R sample scripts for easier data handling and visualization were developed and included in the repository. This enables the general public, including those who may not have strong coding and climate background, to easily access and utilize our data for downstream input. The improvement in access to our data may enhance public knowledge in climate change and visualization of similar kinds of data. During the development of this project, we have also gained experience in making our data more user-friendly to those unfamiliar with climate data processing. The platform been useful for sharing of our data, and our experience may also be helpful for some courses from the Earth and Environmental Sciences Programme, as well as in other educational projects, such as e-learning material development.
DMPID/URL	https://doi.org/10.48321/D1S605

Faculty of Social Science
Title	Tracking Journalistic Sources in Chinese Media (2022-23)
Principal Investigator/ Co-investigator(s)	Name	Affiliation
Principal Investigator	FANG Kecheng	School of Journalism and Communication
Abstract	This project builds a database to track who have been interviewed and quoted in major Chinese media outlets. The analysis of those individuals—known as “sources”—could help us understand Chinese media and society, including propaganda strategies, the power of elites, gender and ethnic relations, and the ideological divide in the society. More specifically, the project focuses on collecting and processing sourcing data from late 2022 to 2023/24. The resulted dataset can be used in conducting further academic research and writing short reports for the public.
Start Date	1-Jul-2022
End Date	30-Jun-2024
Related Publication	Guo, J., Huang, X., & Fang, K. (2023). Authoritarian environmentalism as reflected in the journalistic sourcing of climate change reporting in China. Environmental Communication, 17(5), 502-517. https://doi.org/10.1080/17524032.2023.2223774
Project Report Summary	This project is building a comprehensive dataset of sources quoted in major Chinese media outlets. By analyzing who gets a voice in the news, we gain valuable insights into various aspects of Chinese media and society. The dataset analyzes hundreds of thousands of articles from dozens of media outlets. Each source is meticulously categorized with information like their profession, gender, and area of expertise. This data allows for in-depth analysis of sourcing patterns and trends over time. The dataset is a valuable resource for researchers studying Chinese media, politics, and society. We have already used this dataset to produce journal articles. The following article was published in 2023, and more articles are under review: Guo, J., Huang, X., & Fang, K. (2023). Authoritarian environmentalism as reflected in the journalistic sourcing of climate change reporting in China. Environmental Communication, 17(5), 502-517. I also presented the dataset and relevant research at major international conferences including the International Communication Association conference and the Association for Asian Studies conference. Scholars studying China’s media and society show great interest in this dataset and we are developing potential collaborations in using this dataset for future research. The RDM Development Fund has helped secure raw data of this project, and a robust system was developed to ensure data integrity and accessibility for future use. It also paid for both human coders (student helpers) and generative AI tools to analyze and clean the data. Student helpers gain valuable experience in data processing and analysis, contributing to their professional development. It also supported the attendance of international conferences, broadening the reach and impact of the project.
DMPID/URL	https://dmptool.org/plans/79459