The China Biographical Database User’s Guide¶

Michael A. Fuller

Revised Version 3.6

July 26, 2024

Preface to the User’s Guide¶

Peter K. Bol

The China Biographical Database, as a relational database, can generate biographical data in response to simple queries (who came from a certain place?) and to far more complex queries (what were the social and kinship connections among all those who entered government through the civil service examination from a certain place within a certain span of years?). Users can query CBDB through an online database (follow the links on the CBDB website, https://cbdb.hsites.harvard.edu/. Users also can download the entire database, together with query forms and utilities for exporting data for network and spatial analysis, from the CBDB website and explore the database on any computer with Microsoft Access. We also offer a SQLite format database for quantitative researchers and Mac users. This User’s Guide explains the structure and application for the downloadable, stand-alone database.

CBDB is a relational database. It categorizes and codes many different aspects of the life histories of men and women in China’s past. In using it, there are several considerations that one should bear in mind when reading the Users’ Guide’s presentation of the specific details of the database, its design, and its use.

A way of thinking about people in context. CBDB is a way of modeling life histories; it is also a way of thinking about how to organize information. The subject of the database is people in society, but we treat people as entities that have relationships to their kin and their social associations, to places where they resided and worked, to times when they lived and moments when they acted, to names they were given and adopted, to books they wrote, to ways in which they entered government or other institutions, and to the modes in which they distinguished themselves from others. In contrast to the narrative of a life, CBDB sees people as entities defined by webs of relationships that can be quantified and analyzed.

Temporal scope. Over ninety percent of CBDB data pertains to the period from the Tang dynasty (618-907) into the early 20th c. As of this writing 1, 2019 it had data on about 472,000 figures with well over 100,000 more in preparation; further data on figures already in the database are frequently added. Tables and trees of place names and official titles will need to be expanded as we incorporate figures from earlier periods.

Factoids versus facts. Like prosopographical databases for other parts of the world, CBDB for the most part deals in “factoids,” the assertions of a fact (such as “Su Shi was a person from Meishan”) found in the historical sources it references. It relates these assertions, including contradictory assertions when they appear, rather than judging their reliability. However, it does not treat all sources as equal.

Principal sources. CBDB began with research conducted by the late Robert Hartwell focused on the middle period of China’s history. Since then, it has been comprehensively incorporating data from published indices, such as Wang Deyi’s revised Index to Biographical Sources for Song Figures and similar works; from online databases, such as the Name Authority Database of the Ming Qing Archive at Academia Sinica, the Tang Knowledge Base at Kyoto University and the Ming Qing Women’s Writings Database directed by Grace Fong at McGill University; from studies of text sources such as collections of epitaphs (墓誌銘); from listings of local officials in local gazetteers and records of appointments; and from biographies in formal dynastic sources. Although CBDB editors at Harvard and Peking University are experimenting with mining data from other sources, it will take some time before the principal sources are exhausted.

Text-mining. The most efficient way to populate CBDB has been through the use of computational text-mining techniques to cull factoids from searchable digital texts that have been provided by the Institute of History and Philology at Academia Sinica or generated by the CBDB project itself. This began in collaboration with computer scientists on an US National Endowment for the Humanities grant. The Harvard editorial team, led first by Professor Song Chen and then Dr. Shih-pei Chen and currently by Mr. Hongsu Wang, who has had the assistance of Dr. Lik Hang Tsui, Mr. Merrick Lex Berman, and Ms Edith Enright has overseen the development of “regular expressions” appropriate to Chinese sources and the process of incorporating new data. The Peking University editorial team reviews the marked-up text, and the managers then oversee the final coding of the data for inclusion in CBDB. This process does not guarantee that all possible factoids are discovered, simply that that those included will accurately reflect the sources being mined.

Margin of error. Machines are more reliable than humans in sifting through large quantities of data but are incapable of interpretation and scholarly judgment. Errors can enter the database. The historical sources themselves can be incorrect. Editors may miss mistakes in tagging. Encoders may fail to properly disambiguate two entities with the same name. A user must always ask if the query to the database produces enough examples to ensure that the margin of error will not undermine confidence in the conclusions that are drawn. The discrepancies between the sources and the original CBDB data were significant, and considerable time was spent correcting the received data; with the adoption of computational techniques the discrepancies appear to be less than one percent. To put this in perspective: an argument based on 1000 examples of which ten are faulty is better than a finding based on ten examples of which one is erroneous.

A database is not a dictionary. CBDB can be used as a guide to biographical factoids about an individual, and it can provide more data about some aspects of a person’s connections than would be found in a biographical dictionary. However, the standard for a dictionary is complete accuracy in all aspects, whereas the expectation for a database is that the cases discovered will be useful because they are extensive in range and number.

CBDB is a joint project of the Center for Research on Ancient Chinese History at Peking University, the Institute of History and Philology at Academia Sinica, and the Fairbank Center for Chinese Studies at Harvard University. At Harvard it is housed in the Institute for Quantitative Social Sciences which provides administrative support. It is guided by a steering committee that includes scholars and collaborators from across the globe. Michael A. Fuller, the author of this User’s has designed all iterations of the database.

Since 2005 CBDB has been supported by grants from Harvard University Faculty of Arts and Science and the Harvard University Asia Center, the Institute of History and Philology at Academia Sinica, the Center for Research on Ancient Chinese History at Peking University, the National Endowment for the Humanities, the Tang Research Foundation, the Tang Studies Society, the Henry Luce Foundation, the Chiang Ching- kuo Foundation, the Canadian Social Sciences and Humanities Research Council, the bequest of the late Robert Hartwell to the Harvard-Yenching Institute, and significant support from a licensing arrangement with ChineseAll.com. In China CBDB data, supplemented with extensive biographical data on twentieth century figures, is available through subscription to the Yinde System https://www.inindex.cn provided by ChineseAll.com. Over the years many scholars have visited Harvard and contributed to the project, all participants are recognized on the CBDB website.

This User’s Guide explains the logic of CBDB as a relational database, the structure of its contents, the primary query interfaces for getting data from the database, and installation procedures for different operating systems. Please also consult Appendix E of the User’s Guide for a summary of the most recent changes to the database and to the user interface.