Is GitHub OK as a repository for FAIR data? A lot of language data intended for natural language processing technologies is stored on GitHub, together with the code that is used for data processing. It’s a very useful resource.
Thanks in advance.
Hi Diana - I think this is a useful question. There is some information on here regarding trusted repositories (Repository certification) and some further information on the FAIRsFAIR site about the work done on the FAIRification of repositories (Support programme for Data Repositories | FAIRsFAIR), but it would be good to hear from others on here specifically on your question.
GitHub is convenient for making data accessible. However, one important FAIR principle, F1, is that “(meta)data are assigned a globally unique and persistent identifier”. This is important because if someone uses and cites the data, and then where it is stored changes from GitHub to another service a few years from now, the recipient of the citation should still be able to find the data – thus, the link to the data should not be a github.com link. It should, for example, be a DOI link, that will in turn lead one to GitHub for now.
Currently, one need to use a service external to GitHub to obtain and assign such IDs. GitHub has documentation on this at https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content, using Zenodo as an example to obtain DOIs.