研究数据管理发展基金（2022） – 香港中文大学图书馆

研究数据管理发展基金由大学研究事务委员会核下的研究数据管理委员会於2022年首次设立，旨在鼓励中大研究人员和学生将研究数据管理实践融入整个研究生命週期，改善数据管理流程，并推广使用中大研究数据设施和服务。所有中大教授或研究学术职级的全职教职员（即从「研究助理教授」至「教授」）均可以首席调查研究员身分申请资助。

基金於2022年支持了六个项目，资助总额达港币60万元。计划详情如下：

教育学院

Educational Opportunities and Social Mobility in Greater Bay Area (GBA): A Comparative Analysis in Hong Kong and Guangdong

工程学院

A Multi-level Knowledge-driven Customized Dialogue Dataset for General Use
Fake News Detection via Graph Neural Networks

理学院

Facilitation of Simulation Force Field Development with Enhanced Research Data Management
The Greater Bay Area and Global Climate Data Repository: Analysis, Visualization, Education

社会科学院

Tracking Journalistic Sources in Chinese Media (2022-23)

教育学院
计划名称	Educational Opportunities and Social Mobility in Greater Bay Area (GBA): A Comparative Analysis in Hong Kong and Guangdong
首席调查研究员／联席调查研究员	姓名	学院／学系
首席调查研究员	欧冬舒	教育行政与政策学系
联席调查研究员	WONG Kenneth K.	布朗大学 Education Department
摘要	Research is scant on educational opportunities and social mobility in Greater Bay Area including Hong Kong and Guangdong province. While both regions have witnessed greater educational expansion and prosperous economic growth in the past two decades, they also face challenges in intergenerational inequalities and assimilation. Using Hong Kong Census and By-Census data and China Family Panel Studies, we analyze three topics: (1) educational mismatch and labor market consequence of male workers, (2) returns to education and economic assimilation of high-skilled migrant workers, and (3) regional disparity of access and quality of school across GBA. Our study provides an updated picture on the trend of social inequality in Hong Kong and Guangdong province as well as a comparison of the two education systems and labor markets. Research results have implications on the utilization and reward of the human capital in the labor market of GBA area and understanding of potential barriers that create social inequality in the two societies.
开始日期	2022年9月1日
结束日期	2023年8月31日
项目报告总结	We analyse the economic development in the 11 cities in the Guangdong province, Hong Kong and Macau between 2018 and 2020. Absolute gross domestic product, gross domestic product per capita are compared, and significant differences are observed across cities and over time. This implies the differential negative impact of the COVID-19 pandemic on the regional economy. On the other hand, we investigate the difference in the share of economic sectors, which can be used to identify the key drivers of regional economic growth and explain the variation in responses to the health shock during the period. To explain the difference in economic growth in terms of human capital accumulation, we also examine the pupil-teacher ratio as a measure of school quality at different levels of education. Next, we look at the trends in population to see if this can explain the disequilibrium in the regional labour market. We summarise the results in the policy brief articles into charts and figures, which are shared in the CUHK Research Data Repository to the public. These graphical outputs give the readers an update about the shift of economic and educational inequalities in the main GBA cities. They also provide implications on the urbanisation and returns to education, which helps understand the challenges and barriers in the current GBA labour market. The funding was used to hire student helper and support part of the salary of research staff in working in this project. They were trained to use the CUHK library resources to collect data and manage the online data. From this project, they also learnt how to handle sensitive data and handle with data privacy issues.
数据管理计划识别码／网址	https://dmphub.uc3prd.cdlib.net/dmps/10.48321/D10M2X
数据集数位物件识别码（DOI）	https://doi.org/10.48668/YTVMXE
相关出版物	Policy Brief RECEPD No. 2023 – 001, Policy Brief RECEPD No. 2023 – 002, Policy Brief RECEPD No. 2023 – 003, Policy Brief RECEPD No. 2023 – 004

工程学院
计划名称	A Multi-level Knowledge-driven Customized Dialogue Dataset for General Use
首席调查研究员／联席调查研究员	姓名	学院／学系
首席调查研究员	黄锦辉	系统工程与工程管理学系
联席调查研究员	何汉山	系统工程与工程管理学系
摘要	Dialogue systems have been widely applied in the applications such as Apple’s Siri and Google Voice. It helps us to control our smart devices and to complete tasks, such as making an order in a restaurant or booking a flight ticket with our mobile phone. Unfortunately, there is no dialogue system available for the Small and Middle Enterprises (SMEs) in Hong Kong due to the limited Cantonese dialogue dataset information. For this reason, we propose the first Cantonese Knowledge-Driven Dialogue Dataset for General Use (CK-DDD) in Hong Kong, which collects the information in multi-turn conversations from various sources. Now our corpus contains over 800,000 conversational datasets obtained from over a dozen sources in different business sectors. These datasets will be processed and then make accessible to the public via our online repository. CK-DDD will escalate the utilization of Cantonese dialogue systems in various commercial applications, for example, Live AI chatbot for sales inquiry. The corpus will be constantly updated and expanded if the usage of Cantonese becomes increasingly popular, and the information provided will be more sentimental and reliable. Eventually, we will organize this dialogue dataset as self-sustainable and freely accessible information to the public for research and development purpose
开始日期	2022年10月1日
结束日期	2023年9月30日
项目报告总结	This project is all about bringing innovative technology to restaurants in Hong Kong. Think of it like having a digital helper, similar to Siri or Alexa, but designed specifically for Cantonese-speaking restaurants. This is a big deal because small and medium-sized restaurants in Hong Kong don’t have access to this kind of technology in their native language. Our team is working on a system that can hold conversations in Cantonese. We’ve created a special database called the Cantonese Knowledge-driven Dialogue Dataset for REStaurants (KddRES), which will be available online for everyone. This database contains 800 conversations from 10 different restaurants, each with its own unique style. We’re committed to regularly updating this dataset to ensure it remains a valuable resource for all. Our project has three main goals. First, we’ll share the KddRES dataset with the public, making it a valuable resource. Second, we’re developing a dialogue system that can be customized for any restaurant, which is great news for small and medium-sized eateries. Third, we’ll publish a research paper to tell the academic community about our innovative approach. To make sure the data we collect is top-notch, we’ve hosted and organized a series of workshops and meetings. We’re keeping a close eye on data quality and keeping thorough records. This project is backed by CUHK research data management tools, ensuring our data is well-managed and accessible for future use. We’re using Github to share our project’s structure and CUHK data center for downloading the dataset. So, it’s not just about creating a unique dialogue system for restaurants – it’s also about promoting good data management practices.
数据管理计划识别码／网址	https://dmphub.uc3prd.cdlib.net/dmps/10.48321/D11C83
数据集数位物件识别码（DOI）	https://doi.org/10.48668/JNCJ8P
相关出版物	Wang, H., Li, M., Zhou, Z., Fung, G.P., & Wong, K.-F. (2020). KddRES: A Multi-level Knowledge-driven Dialogue Dataset for Restaurant Towards Customized Dialogue System. ArXiv. https://doi.org/10.48550/arxiv.2011.08772.

工程学院
计划名称	Fake News Detection via Graph Neural Networks
首席调查研究员／联席调查研究员	姓名	学院／学系
首席调查研究员	王思博	系统工程与工程管理学系
摘要	Online Social Networks, the public platform where everyone can express and communicate, not only facilitate users to read and discuss news, but also lower the threshold of “we media” so that everyone can become the source of news, which inevitably makes social network a hotbed for fake news. From the 2016 US election to the COVID-19 epidemic and the outbreak of the Russia-Ukraine war, lots of fake news has swept online social networks. Fake news such as disinfection water killing coronavirus has brought threats to our psychological and physical health. For the task of fake news detection, compared with traditional news websites, social networks have a unique feature of news transmission topology. How to use powerful methods like GNN to model the content feature and topology of news is an important direction in this research field. We aim to detect fake news by generating critical diffusion structures that determine the veracity of the news. Besides, datasets used in the existing work are old and do not contain the latest hot events in recent years. Therefore, we will collect an up-to-date fake news dataset on social networks to ensure that detection methods remain viable in the new era.
开始日期	2022年10月1日
结束日期	2024年3月31日
项目报告总结	The proliferation of rumors on social media platforms during significant events, such as the US elections and the COVID-19 pandemic, has a profound impact on social stability and public health. Our project focuses on enhancing the detection of rumors on social media based on the characteristics of propagation patterns. Existing methods struggle when the data contains irrelevant or misleading information, especially when rumors are still emerging and lack detailed propagation paths. To address these challenges, we developed the Key Propagation Graph Generator (KPG), a sophisticated tool that uses reinforcement learning to analyze and generate useful social media data structures for rumor detection. The KPG framework has two main components: the Candidate Response Generator (CRG) and the Ending Node Selector (ENS). The CRG filters out irrelevant data and creates new data points that help in identifying the spread of information, while the ENS pinpoints key patterns in the data that are indicative of misinformation or trustworthy information. Extensive experiments conducted on four datasets demonstrate the superiority of our KPG compared to the state-of-the-art approaches. Furthermore, our project has constructed a new dataset using the Weibo platform, focusing on the spread of COVID-19 information from November 2019 to March 2022. This dataset includes detailed records of how each piece of news was shared across social media, including who shared it and the network connections among the shares. 2,087 rumors and 2,087 non-rumors are contained in the dataset. The rumors comes from Weibo Community Management Center, an official service where users can report either a microblog that contains false information, and the on-rumors comes from official accounts, the rumor-refuting microblogs provided by Weibo Community Management Center and China Internet Joint Rumor Debunking Platform. The results and methodologies developed from this project are expected to impact both academic research and practical applications in social media analytics, especially in strengthening the public’s resilience against misinformation.
数据管理计划识别码／网址	https://dmphub.uc3prd.cdlib.net/dmps/10.48321/D15D2FCDEC
数据集数位物件识别码（DOI）	https://doi.org/10.48668/IPJLUK
相关出版物	A conference paper entitled ‘Rumor Detection on Social Media with Reinforcement Learning-based Key Propagation Graph Generator’ by Yusong Zhang et al. directly emerged from this project. Yusong Zhang, Kun Xie, Xingyi Zhang, Xiangyu Dong, Sibo Wang: KPG: Key Propagation Graph Generator for Rumor Detection based on Reinforcement Learning. CoRR abs/2405.13094 (2024)

理学院
计划名称	Facilitation of Simulation Force Field Development with Enhanced Research Data Management
首席调查研究员／联席调查研究员	姓名	学院／学系
首席调查研究员	谢应龙	化学系
摘要	At the heart of a molecular dynamics (MD) simulation is the force field that describes the molecular interactions. As the computer algorithms become more sophisticated, the development becomes more complicated, and it has become increasingly difficult to describe all the details of a force field in journal papers or writing in general. Manually inputting the numerical data is prone to typographical errors, but even an apparently small difference in the force field could lead to drastically different numerical results. Fortunately, such issues can be readily resolved by good research data management by sharing the simulation files and the instructions on a reliable/reputable platform. CUHK DMP paired with CUHK Research Data Repository recently introduced would be an ideal platform for hosting and sharing the project description, simulation files, and instructions with the other researchers.
开始日期	2022年12月1日
结束日期	2024年5月31日
项目报告总结	Thanks to the Research Data Management (RDM) development fund, our project has gained a deeper understanding of how RDM can facilitate the sharing of valuable data for molecular dynamics simulations that utilize the DeepMD kit. The fund enabled us to hire a student helper and acquire additional computer storage, crucial for handling the large volumes of training data needed for our neural network models. Our research involves creating neural network models with the DeepMD kit, a machine-learning toolkit designed for training artificial neural networks using computationally intensive reference data. The models we develop are stored as graph files on researchdata.cuhk.edu.hk. Despite the large amount of data required for training, the final model files are compact, consisting mainly of the network weights. This compactness is similar to large-language models like ChatGPT, which, although trained on extensive datasets, are ultimately defined by their weights. These models are easily accessible to other researchers via our data repository, simplifying the process for those interested in applying our methods to their own molecular dynamics studies. This accessibility not only fosters collaboration but also accelerates advancements in the field by providing a ready-to-use resource that can be seamlessly integrated into diverse projects.
数据管理计划识别码／网址	https://dmphub.uc3prd.cdlib.net/dmps/10.48321/D1W035
数据集数位物件识别码（DOI）	1)DeepMD model of the air-water interface: https://doi.org/10.48668/OX2ZCD 2)DeepMD model of water-oil interfaces: https://doi.org/10.48668/WVQDAK 3)DeepMD model of water-oil interfaces with catalyst: https://doi.org/10.48668/OCLLON

理学院
计划名称	The Greater Bay Area and Global Climate Data Repository: Analysis, Visualization, Education
首席调查研究员／联席调查研究员	姓名	学院／学系
首席调查研究员	谭志勇	地球系统科学课程
联席调查研究员	欧阳绮雯	地球系统科学课程
联席调查研究员	李钧杰	地球系统科学课程
联席调查研究员	杨锦霖	地球系统科学课程
摘要	The proposed project plans to launch a climate data repository for simulated data from regional and global climate models from our research group. The data uploaded provide the historical and future climate information. They are useful for projections of the future climate, atmospheric environment, hydrology, agriculture, wind power, etc. both globally and in the Greater Bay Area. These data will be accessible to all interested parties, especially those working in related fields. However, these output data from the research models are usually in NetCDF format, which might be challenging to be analyzed and visualized directly. Therefore, it is also proposed to develop a set of codes and programmes to convert these data into easily analyzable figures and graphs for those who might not be familiar with working with these output data files. Together with the codes, the climate data repository provides an excellent tool for the users who might not have strong coding and climate background like the undergraduates in different majors, media, and the general public, to visualize and analyze the atmospheric circulation for the present as well as future climate.
开始日期	2023年1月1日
结束日期	2023年12月31日
项目报告总结	A climate data repository for simulated data from regional and global climate models from our research group, namely the bias-corrected CMIP6 global dataset for dynamical downscaling for the historical and future climate (1979–2100), has been launched at the CUHK Research Data Repository and is open to public access. The data provides global historical and future climate information, useful for dynamical downscaling projections of the Earth’s future climate, atmospheric environment, hydrology, agriculture, wind power, etc. README files describing the raw data files, as well as R sample scripts for easier data handling and visualization were developed and included in the repository. This enables the general public, including those who may not have strong coding and climate background, to easily access and utilize our data for downstream input. The improvement in access to our data may enhance public knowledge in climate change and visualization of similar kinds of data. During the development of this project, we have also gained experience in making our data more user-friendly to those unfamiliar with climate data processing. The platform been useful for sharing of our data, and our experience may also be helpful for some courses from the Earth and Environmental Sciences Programme, as well as in other educational projects, such as e-learning material development.
数据管理计划识别码／网址	https://doi.org/10.48321/D1S605

社会科学院
计划名称	Tracking Journalistic Sources in Chinese Media (2022-23)
首席调查研究员／联席调查研究员	姓名	学院／学系
首席调查研究员	方可成	新闻与传播学院
摘要	This project builds a database to track who have been interviewed and quoted in major Chinese media outlets. The analysis of those individuals—known as “sources”—could help us understand Chinese media and society, including propaganda strategies, the power of elites, gender and ethnic relations, and the ideological divide in the society. More specifically, the project focuses on collecting and processing sourcing data from late 2022 to 2023/24. The resulted dataset can be used in conducting further academic research and writing short reports for the public.
开始日期	2022年7月1日
结束日期	2024年6月30日
相关出版物	Guo, J., Huang, X., & Fang, K. (2023). Authoritarian environmentalism as reflected in the journalistic sourcing of climate change reporting in China. Environmental Communication, 17(5), 502-517. https://doi.org/10.1080/17524032.2023.2223774
项目报告总结	This project is building a comprehensive dataset of sources quoted in major Chinese media outlets. By analyzing who gets a voice in the news, we gain valuable insights into various aspects of Chinese media and society. The dataset analyzes hundreds of thousands of articles from dozens of media outlets. Each source is meticulously categorized with information like their profession, gender, and area of expertise. This data allows for in-depth analysis of sourcing patterns and trends over time. The dataset is a valuable resource for researchers studying Chinese media, politics, and society. We have already used this dataset to produce journal articles. The following article was published in 2023, and more articles are under review: Guo, J., Huang, X., & Fang, K. (2023). Authoritarian environmentalism as reflected in the journalistic sourcing of climate change reporting in China. Environmental Communication, 17(5), 502-517. I also presented the dataset and relevant research at major international conferences including the International Communication Association conference and the Association for Asian Studies conference. Scholars studying China’s media and society show great interest in this dataset and we are developing potential collaborations in using this dataset for future research. The RDM Development Fund has helped secure raw data of this project, and a robust system was developed to ensure data integrity and accessibility for future use. It also paid for both human coders (student helpers) and generative AI tools to analyze and clean the data. Student helpers gain valuable experience in data processing and analysis, contributing to their professional development. It also supported the attendance of international conferences, broadening the reach and impact of the project.
数据管理计划识别码／网址	https://dmptool.org/plans/79459