Plos collection data dump

amchagas · January 26, 2021, 9:55pm

Hi there!

As some might have seen/heard, PLOS has decided to discontinue their open source toolkit collection, which is a bit of a pity since now we do not have an ongoing curated list of academic articles about open science hardware.

@jcm80, Tom Baden and I were the editors of the channel and we got a little “data dump” from PLOS people in the format of an excel file :sad: which contains all of the articles that were present in the collection.

The three of us thought it would be good to turn that into a similar resource to https://open-neuroscience.com, that is, a community driven website, where people can “deposit” their own papers.

The system also has:

a search tool
filtering capabilities by tags
it is based on Hugo and GH pages, with an auto deploy action whenever things are pushed to the main branch.

This was supposed to be done a while ago, but as always, time fell short and I only managed to get so far… At this point something like this was also added to the wish list for the GOSH website, I think?

So here is the “state of the art” for what I have done so far:

1 - I created a template website that lives on Open Source Toolkit
2 - The data dump is now a csv file
3 - everything is in a open repo here: GitHub - amchagas/open-source-toolkit: An open implementation of the Open Source Toolkit (a curated list of articles describing open source tools for research))
4 - There is python code that converts data from the CSV to .md files with a yaml header and regular .md text and stores each file in a specific folder with a “featured” image

Work needed:
1 - Split the software from hardware papers, as currently they are all mixed up
2 - work on the tags filtering system as the current one is from open neuroscience still.
3 - work on the tags in each paper, as right now they are still using some sort of plos numbering code (listed on the csv file)

If anyone find it useful and would/could contribute, that would be wonderful!

access2perspectives · January 27, 2021, 8:05am

Hey André,
sad and great to read this - happy that you have the plan B already lined up.
Additionally to your approach, we could co-curate a collection on Zenodo, Science Open and/ or Figshare on the same topic " open source toolkit collection" we could bulk upload the whole current set in one go and thus preserve the collection as is and allow ourselves and others to continue curating it on our preferred OA channel/s
What do you think?

amchagas · January 27, 2021, 8:40am

Hi Jo,

Thanks for your thoughts!

Having this little “paper database” curated in one of those platforms is of course fine, (would you mind sharing which advantages do you see of having it over in those?). At the same time (and I might be wrong here), they do not offer the opportunity to filter content/research for keywords in the papers deposited in the database, correct? which makes them a bit harder to use in terms of sifting through the data? Only this CSV file I think has something like +450 entries…

Another point is that if we start running parallel instances of the same thing, we end up with the issue that they might start diverging? Which would be quite sad I think.

What about making snapshots of the repository using Zenodo (the only one I know integrates with GH?) for instance? this would contain all the information about the papers, and make sure we have an unified source?

Another route could be to have the list of papers in Zenodo and see how/if we could pull that data to parse it to a website?

In any case, for whichever platform and for whichever route we choose, the most important thing at the moment (as far as I can tell) would be to clean up this CSV file, since right now things are mixed and not super useful from an end user point of view.

julianstirling · January 27, 2021, 2:42pm

I would say there are 2 big advantages to Zenodo:

You get a DOI which guarantees some level of permanence
It is a recognised organisation backed by CERN, OpenAIRE, and the EU

Zenodo does have keyword filtering:

As you say it is possible to use the Zenodo API from GitHub, I am not sure if you can make a new item with a DOI for different things inside the repo, it would be nice to get each article a DOI.

Of course @access2perspectives is the best person to talk to about this as AfricArXiv runs on zenodo. I suppose any curation features that Open Source Toolkit needs would probably be of interest to AfricaArXiv. So if Open Source Toolkit uses zenodo this can be implemented once and shared

unixjazz · January 27, 2021, 6:37pm

@amchagas that is a real pity!

Good thing is that we have our own Journal of Open Hardware, which has been in need of more substantial help from our community for the past 3 years! It could be a great venue for the future publications that would otherwise go to PLOS.

I second the suggestion of using Zenodo: it gives you all the features you will need for long-term preservation. Plus, it is so convenient to use.

amchagas · January 27, 2021, 7:17pm

Hi @unixjazz !

Sorry, maybe I was too short on my explanation of the open source toolkit. This was not a venue to publish papers, but simply a place where we would curate a list of papers that came out and were related to open hardware. This included many sources, JOH, PLOS papers, HardwareX, IEEE, etc.

As far as I know, PLoS is not stopping their publishing of papers related to open hardware…

unixjazz · January 27, 2021, 8:59pm

Hey there @amchagas!
Totally, I understand.

I am the one who did not explain myself, so I apologize: I said “publication” in the broadest sense possible, so a blog post or a research report… or an evaluation of what the state-of-the-art is in a certain domain is totally part of the scope of JOH!

We also have pieces that are “conference reports” that can be useful to keep a minimal record of our collective memories. This is what I was thinking of!

solstag · January 27, 2021, 9:18pm

Ni! Dude… am I delirious or ain’t this a textbook scenario for using Wikidata as backend?! .~´

amchagas · January 27, 2021, 9:42pm

@unixjazz , ah ok! thanks for clarifying!

@access2perspectives and @julianstirling: I had a closer look at Zenodo, created a community and I am now under the impression that we would have to add each paper by hand?

Or is there a tool to batch process info based on DOI/link? If my impression is correct, this is just silly, sorry. The data we have from PLOS is too sparse, and the job would have excruciating for >450 entries.

@solstag I know very little about Wikidata… Care to explain how we could plug this data there?

unixjazz · January 27, 2021, 9:49pm

@amchagas: look into the API: https://developers.zenodo.org

amchagas · January 27, 2021, 9:50pm

One more thing, this is the raw data PLOS managed to send us:

github.com

amchagas/open-source-toolkit/blob/main/plos-items.csv

URI (DOI or URL),Hardware or software,Title (URL items only),Authors (URL items only),"Content Type (URL items only - Research Article, Web Article, Commentary, Video, Poster)",Date Published (URL items only),Source (URL items only),Summary,"Featured Rank (1 = Editor's Pick, 2-6 = Featured Research, 7-12 = Related Content)","Paywall (x = paywall, otherwise leave blank)","Featured Preprint (x = Featured Preprint, otherwise leave blank)",Remove (x = delete the item)
https://futurism.com/the-byte/github-apocalypse-proof-vault,software,GitHub Sealed Away Its Open Source Code In an Arctic Fault ,DAN ROBITZSKI,Web Article,2020-08-18,Futurism,"Earlier this month, the code management platform GitHub sealed away its archive of open source software in an Arctic vault so deep that they say it could survive a nuclear blast. 
The mildly-outlandish idea behind the move, Engadget reports, is to give a boost to future generations after a hypothetical civilization-ending catastrophe. Should that happen, whatever civilization emerges from the ashes won’t have to start from scratch and could instead tap the knowledge of modern-day coders and engineers.",7,,,
https://www.forbes.com/sites/bryancampbell/2020/07/14/new-open-source-software-looks-to-kickstart-the-autonomous-revolution/#7d1030d5bb5d,software,New Open-Source Software Looks To Kickstart The Autonomous Revolution,Bryan Campbell,Web Article,2020-08-18,Forbes,"This article reports on ""https://www.project-aslan.org/"">Project ASLAN<, a collaborative alternative to the tech race at the forefront of the move towards fully-autonomous vehicles. ASLAN is a supergroup of tech companies, research groups and universities, focused on bringing a fiscally practical, fully-autonomous vehicle solution to public roads. Here, Bryan Campbell sits down with one of the founding members of the group, Street Drone CEO Mike Potts to find out why he thinks a collaborative effort and open-source material is the only way forward.",8,,,
https://www.techrepublic.com/article/this-new-open-source-project-could-be-key-to-securing-database-applications/,software,This new open source project could be key to securing database applications ,Matt Asay ,Web Article,2020-08-18,TechRepublic,"Moving database applications to the cloud has been a boon for development teams anxious to move faster. It has also, however, exposed security flaws inherent in traditional security solutions, something data layer security startup Cyral has been tackling. To take this a step further, Cyral recently open sourced a project called Approzium to enable developers to better observe and secure data by themselves. ",10,,,
10.1371/journal.pone.0230697,software,Assessing the impact of introductory programming workshops on the computational reproducibility of biomedical workflows,Ariel Deardorff,Research Article,2020-08-18,PLOS ONE,"This mixed methods study consisted of in-depth interviews with 14 biomedical researchers before and after participation in an introductory programming workshop. During the interviews, participants described their research workflows and responded to a quantitative checklist measuring reproducible behaviors. While none of the workshop participants completely changed their workflows, many of them did incorporate new practices, tools, or methods that helped make their work more reproducible and transparent to other researchers. This indicates that programming workshops now offered by libraries and other organizations contribute to computational reproducibility training for researchers.",9,,,
10.1371/journal.pcbi.1007981,software,The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies,"Aleksey V. Zimin, Steven L. Salzberg",Research Article,2020-08-18,PLOS Computational Biology,"In this report, Zimin and Salzberg present the assembly polishing tool POLCA (POLishing by Calling Alternatives) and compare its performance with two other popular polishing programs, Pilon and Racon. They show that on simulated data POLCA is more accurate than Pilon, and comparable in accuracy to Racon. On real data, all three programs show similar performance, but POLCA is consistently much faster than either of the other polishing programs.",5,,,
10.1371/journal.pone.0232946,software,RefShannon: A genome-guided transcriptome assembler using sparse flow decomposition,"Shunfu Mao, Lior Pachter, David Tse, Sreeram Kannan",Research Article,2020-08-18,PLOS ONE,"High throughput RNA-Seq has become a staple in modern molecular biology, with applications not only in quantifying gene expression but also in isoform-level analysis of the RNA transcripts. To enable such an isoform-level analysis, a transcriptome assembly algorithm is utilized to stitch together the observed short reads into the corresponding transcripts. This task is complicated due to the complexity of alternative splicing - a mechanism by which the same gene may generate multiple distinct RNA transcripts. Mao et al develop a novel genome-guid

This file has been truncated. show original

briannaljohns · January 27, 2021, 11:03pm

Hi André! I’d be happy to help clean up the CSV file. Perhaps it may be best to organize a date/time where we can set aside a few hours and have a small group of people just work together to split up the papers and work on tagging them? I don’t mind organizing a session for this if so . Since this is something added to the GOSH website wish list, we could also discuss the possibility of getting it running on an openhardware.science domain.

access2perspectives · January 28, 2021, 12:03am

I found this bulk upload explainer: GitHub - darvasd/upload-to-zenodo: Simple script to batch upload submissions to @zenodo (personally have noe experience with such)
Zenodo has a super-responsive support team, so we could just email them the challenge and ask them to help us.
The other place mentioned and certainly capable of handling it is ScienceOpen - if you agree, we could pass it on to Stephanie Dawson to consult her tech team. Otherwise, their bulk intake to collections is easy and straightforward, just add all doi’s and their scraper bots will do the job.
See ScienceOpen Collections - About ScienceOpen

solstag · January 28, 2021, 1:06am

Ni! Let us get this straight, meaning correct me if I’m mistaken:

The collection was a storefront for articles published in diverse publications. Meaning it is about showcasing, not hosting, the articles. The material we want to publish and expand is, therefore, article METADATA, not full articles.

About Wikidata:

Hosting facts about diverse entities, and not the entities themselves, is the goal of Wikidata.
Since it is a semantic universal database, you get for free that your entities can be queried in connection with tons of other information.
You can add meaningful properties in connection with millions of already structured entities, including other indexed literature.
As a nice treat, you can ask which entities in the collection are also used as sources in Wikipedia.
Since it is a wiki, it should be easy to collaborate and curate it even for people with little technical skills.
Articles can be inserted automatically as well, even enriched with fact extraction — @jcm80 can tell you about this
Since it is hosted by Wikimedia and has a powerful API endpoint, we don’t have to worry about the database infrastructure or how to share it across sites.
GOSH can build its interface entirely independently from hypothetical opensourcetoolkit.science, and yet they’d immediately share all contributions and updates.

To give you a concrete idea, this is an example of a Wikidata entry that represents a scientific article:

https://www.wikidata.org/wiki/Q105037759

This is, interestingly, an article that talks about (the much more complex task of) producing and organizing general information about Covid-19 on Wikidata.

There’s plenty of bots and software libraries to work with Wikidata. There’s also a whole community dedicated to hosting article metadata there, called WikiCite.

EDIT: btw, if you’ve never played with Wikidata, perhaps first try a simple tutorial! It will help understand the rest.

Abraço,
.~´

julianstirling · January 28, 2021, 6:47pm

@amchagas I think I misunderstood a little when I was suggesting Zenodo. I didn’t realise that it was just a short summary and a link to another paper.

The back end seems pretty simple, whether it is a CSV file on github or wikidata. The issue is the front end which makes it look pretty and be searchable. Perhaps someone who understands the magic of javascript better than I do could suggest something. @kaspar ?

amchagas · January 29, 2021, 1:45pm

Hi all,

thanks for all of your inputs! I have already learned some new things

@solstag : You are correct, that the PLOS website was just having some metadata about the articles collected there. In fact, in most cases, it was really just a link to the original article. In other cases, there was a “editor’s summary”, but that was pretty much it.

@briannaljohns Thanks for the offer! I think this is going to be quite helpful. I think before diving in to cleaning up, we should define what we want to clean/add/remove from the file? Maybe continuing this here as an open discussion is good, once we decide if as a community, curating these papers is something we still want to do? and of so, how?

@julianstirling: so the website template I am using already has search and filtering capabilities. This is not that pretty, but works as a minimal viable thing. I think before spending time on developing something pretty, I would degfine, what and how we want to store things?

@access2perspectives: If I am understanding everything correctly, I think adding things to a collection, where the actual papers are archived, could come in as an add-on? What I mean is when we (or someone) keeps curating these papers and storing their metadata somewhere, it should be easy to then archive the papers using one of the tools you mentioned?

So in line with the individual points above, if I am getting all of this correctly, what we need to worry mostly about is not how to store the current data that we have, as there seem to be many different tool options for that, but rather think if we as a community want to keep doing the work of curating papers, and create a small pipeline for that?

access2perspectives · January 31, 2021, 11:27am

yes, that cold work. ScienceOpen allows for automated curation, so we just need to keep an eye on it from time to time to keep it clean. Each time a new article gets added by their bot they send an email notification for verification. See e.g. COVID-19 Research in and about Africa – ScienceOpen
Via Zenodo, we can bulk upload once in a while (weekly/monthly/…)

unixjazz · February 3, 2021, 3:09pm

For wikidata:

Daniel Mietchen from Wikidata / Scholia helped me upload the collection of papers @jarancio and I curated on Zotero:

https://scholia.toolforge.org/topic/Q159172

If you want to get your papers in, @amchagas, there is a chance they are already registered!

If it is not, you will help us have all the literature on open hardware on wikidata / scholia (which is not an ambitious task… the literature is not big yet

Best wishes,
LF.