Plos collection data dump

Hi there!

As some might have seen/heard, PLOS has decided to discontinue their open source toolkit collection, which is a bit of a pity since now we do not have an ongoing curated list of academic articles about open science hardware.

@jcm80, Tom Baden and I were the editors of the channel and we got a little “data dump” from PLOS people in the format of an excel file :sad: which contains all of the articles that were present in the collection.

The three of us thought it would be good to turn that into a similar resource to, that is, a community driven website, where people can “deposit” their own papers.

The system also has:

  • a search tool
  • filtering capabilities by tags
    it is based on Hugo and GH pages, with an auto deploy action whenever things are pushed to the main branch.

This was supposed to be done a while ago, but as always, time fell short and I only managed to get so far… At this point something like this was also added to the wish list for the GOSH website, I think?

So here is the “state of the art” for what I have done so far:

1 - I created a template website that lives on Open Source Toolkit
2 - The data dump is now a csv file
3 - everything is in a open repo here: GitHub - amchagas/open-source-toolkit: An open implementation of the Open Source Toolkit (a curated list of articles describing open source tools for research))
4 - There is python code that converts data from the CSV to .md files with a yaml header and regular .md text and stores each file in a specific folder with a “featured” image

Work needed:
1 - Split the software from hardware papers, as currently they are all mixed up
2 - work on the tags filtering system as the current one is from open neuroscience still.
3 - work on the tags in each paper, as right now they are still using some sort of plos numbering code (listed on the csv file)

If anyone find it useful and would/could contribute, that would be wonderful!


Hey André,
sad and great to read this - happy that you have the plan B already lined up.
Additionally to your approach, we could co-curate a collection on Zenodo, Science Open and/ or Figshare on the same topic " open source toolkit collection" we could bulk upload the whole current set in one go and thus preserve the collection as is and allow ourselves and others to continue curating it on our preferred OA channel/s :slight_smile:
What do you think?


Hi Jo,

Thanks for your thoughts!

Having this little “paper database” curated in one of those platforms is of course fine, (would you mind sharing which advantages do you see of having it over in those?). At the same time (and I might be wrong here), they do not offer the opportunity to filter content/research for keywords in the papers deposited in the database, correct? which makes them a bit harder to use in terms of sifting through the data? Only this CSV file I think has something like +450 entries…

Another point is that if we start running parallel instances of the same thing, we end up with the issue that they might start diverging? Which would be quite sad I think.

What about making snapshots of the repository using Zenodo (the only one I know integrates with GH?) for instance? this would contain all the information about the papers, and make sure we have an unified source?

Another route could be to have the list of papers in Zenodo and see how/if we could pull that data to parse it to a website?

In any case, for whichever platform and for whichever route we choose, the most important thing at the moment (as far as I can tell) would be to clean up this CSV file, since right now things are mixed and not super useful from an end user point of view.

1 Like

I would say there are 2 big advantages to Zenodo:

  1. You get a DOI which guarantees some level of permanence
  2. It is a recognised organisation backed by CERN, OpenAIRE, and the EU

Zenodo does have keyword filtering:

As you say it is possible to use the Zenodo API from GitHub, I am not sure if you can make a new item with a DOI for different things inside the repo, it would be nice to get each article a DOI.

Of course @access2perspectives is the best person to talk to about this as AfricArXiv runs on zenodo. I suppose any curation features that Open Source Toolkit needs would probably be of interest to AfricaArXiv. So if Open Source Toolkit uses zenodo this can be implemented once and shared :slight_smile:

1 Like

@amchagas that is a real pity!

Good thing is that we have our own Journal of Open Hardware, which has been in need of more substantial help from our community for the past 3 years! It could be a great venue for the future publications that would otherwise go to PLOS.

I second the suggestion of using Zenodo: it gives you all the features you will need for long-term preservation. Plus, it is so convenient to use.


Hi @unixjazz !

Sorry, maybe I was too short on my explanation of the open source toolkit. This was not a venue to publish papers, but simply a place where we would curate a list of papers that came out and were related to open hardware. This included many sources, JOH, PLOS papers, HardwareX, IEEE, etc.

As far as I know, PLoS is not stopping their publishing of papers related to open hardware…

1 Like

Hey there @amchagas!
Totally, I understand.

I am the one who did not explain myself, so I apologize: I said “publication” in the broadest sense possible, so a blog post or a research report… or an evaluation of what the state-of-the-art is in a certain domain is totally part of the scope of JOH!

We also have pieces that are “conference reports” that can be useful to keep a minimal record of our collective memories. This is what I was thinking of!

1 Like

Ni! Dude… am I delirious or ain’t this a textbook scenario for using Wikidata as backend?! .~´

1 Like

@unixjazz , ah ok! thanks for clarifying!

@access2perspectives and @julianstirling: I had a closer look at Zenodo, created a community and I am now under the impression that we would have to add each paper by hand?

Or is there a tool to batch process info based on DOI/link? If my impression is correct, this is just silly, sorry. The data we have from PLOS is too sparse, and the job would have excruciating for >450 entries.

@solstag I know very little about Wikidata… Care to explain how we could plug this data there?

1 Like

@amchagas: look into the API:

1 Like

One more thing, this is the raw data PLOS managed to send us:

Hi André! I’d be happy to help clean up the CSV file. Perhaps it may be best to organize a date/time where we can set aside a few hours and have a small group of people just work together to split up the papers and work on tagging them? I don’t mind organizing a session for this if so :slightly_smiling_face:. Since this is something added to the GOSH website wish list, we could also discuss the possibility of getting it running on an domain.

1 Like

I found this bulk upload explainer: GitHub - darvasd/upload-to-zenodo: Simple script to batch upload submissions to @zenodo (personally have noe experience with such)
Zenodo has a super-responsive support team, so we could just email them the challenge and ask them to help us.
The other place mentioned and certainly capable of handling it is ScienceOpen - if you agree, we could pass it on to Stephanie Dawson to consult her tech team. Otherwise, their bulk intake to collections is easy and straightforward, just add all doi’s and their scraper bots will do the job.
See ScienceOpen Collections - About ScienceOpen

1 Like

Ni! Let us get this straight, meaning correct me if I’m mistaken:

The collection was a storefront for articles published in diverse publications. Meaning it is about showcasing, not hosting, the articles. The material we want to publish and expand is, therefore, article METADATA, not full articles.

About Wikidata:

  • Hosting facts about diverse entities, and not the entities themselves, is the goal of Wikidata.
  • Since it is a semantic universal database, you get for free that your entities can be queried in connection with tons of other information.
  • You can add meaningful properties in connection with millions of already structured entities, including other indexed literature.
  • As a nice treat, you can ask which entities in the collection are also used as sources in Wikipedia.
  • Since it is a wiki, it should be easy to collaborate and curate it even for people with little technical skills.
  • Articles can be inserted automatically as well, even enriched with fact extraction — @jcm80 can tell you about this :wink:
  • Since it is hosted by Wikimedia and has a powerful API endpoint, we don’t have to worry about the database infrastructure or how to share it across sites.
  • GOSH can build its interface entirely independently from hypothetical, and yet they’d immediately share all contributions and updates.

To give you a concrete idea, this is an example of a Wikidata entry that represents a scientific article:

This is, interestingly, an article that talks about (the much more complex task of) producing and organizing general information about Covid-19 on Wikidata.

There’s plenty of bots and software libraries to work with Wikidata. There’s also a whole community dedicated to hosting article metadata there, called WikiCite.

EDIT: btw, if you’ve never played with Wikidata, perhaps first try a simple tutorial! It will help understand the rest.


1 Like

@amchagas I think I misunderstood a little when I was suggesting Zenodo. I didn’t realise that it was just a short summary and a link to another paper.

The back end seems pretty simple, whether it is a CSV file on github or wikidata. The issue is the front end which makes it look pretty and be searchable. Perhaps someone who understands the magic of javascript better than I do could suggest something. @kaspar ?

Hi all,

thanks for all of your inputs! I have already learned some new things :slight_smile:

@solstag : You are correct, that the PLOS website was just having some metadata about the articles collected there. In fact, in most cases, it was really just a link to the original article. In other cases, there was a “editor’s summary”, but that was pretty much it.

@briannaljohns Thanks for the offer! I think this is going to be quite helpful. I think before diving in to cleaning up, we should define what we want to clean/add/remove from the file? Maybe continuing this here as an open discussion is good, once we decide if as a community, curating these papers is something we still want to do? and of so, how?

@julianstirling: so the website template I am using already has search and filtering capabilities. This is not that pretty, but works as a minimal viable thing. I think before spending time on developing something pretty, I would degfine, what and how we want to store things?

@access2perspectives: If I am understanding everything correctly, I think adding things to a collection, where the actual papers are archived, could come in as an add-on? What I mean is when we (or someone) keeps curating these papers and storing their metadata somewhere, it should be easy to then archive the papers using one of the tools you mentioned?

So in line with the individual points above, if I am getting all of this correctly, what we need to worry mostly about is not how to store the current data that we have, as there seem to be many different tool options for that, but rather think if we as a community want to keep doing the work of curating papers, and create a small pipeline for that?


yes, that cold work. ScienceOpen allows for automated curation, so we just need to keep an eye on it from time to time to keep it clean. Each time a new article gets added by their bot they send an email notification for verification. See e.g. COVID-19 Research in and about Africa – ScienceOpen
Via Zenodo, we can bulk upload once in a while (weekly/monthly/…)

1 Like

For wikidata:

Daniel Mietchen from Wikidata / Scholia helped me upload the collection of papers @jarancio and I curated on Zotero:

If you want to get your papers in, @amchagas, there is a chance they are already registered!

If it is not, you will help us have all the literature on open hardware on wikidata / scholia (which is not an ambitious task… the literature is not big yet :slight_smile:

Best wishes,