The Internet is full of half-truths and outright lies. Search engines do not deliver results based on the truth value of sites, but on popularity, page ranking and such. If, 10 years ago, you were arrested for child porn, with headlines in the newspapers. Three months later, charges were dropped, everyone apologized profoundly to you for the mistake, the government paid a ton of money for your troubles and the prosecutor who go your arrested lost his job.
That sounds nice in theory, but your stance makes a few assumptions: 1) there is a perfect objective view of what the truth is and 2) the internet is not a dynamic, adaptive source of information. For point 1, people may have two different perspectives on what should and should not be public knowledge. For example, if a politician is caught for embezzling money, they may want to be forgotten to avoid further persecution and move on with their life. Voters in other regions may want to know and remember that factoid to avoid putting a historically dishonest person in power. I think both sides have some merit here.
For the second point, the internet is a highly adaptive and dynamic source of information. If you attempt to take information down, someone else may put it up anonymously somewhere else. How does one filter the good information from the bad? Or should we just remove any mention of the person by brute force? What if the person in question has a similar name to yours? This approach may censor potentially damaging information and it may also censor potentially useful information, like your resume or personal website. The-right-to-be-forgotten takes a naive and sometimes despotic approach to controlling information. And, it fails because it ignores the technical constraints to implementing such an idea.
Finally, you didn't need to invoke a variant of Godwin's law to discuss this topic. It's rich and complicated enough without bringing child pornography into it.
I'm not sure how to interpret the results, as the study does not explain what the effect size is, or how impactful it is to general health. If there are any biologists in the crowd who can explain this, that would be super helpful.
This naturally brings us to a bunch of controversial solutions: apprenticeships, subsidized colleges, increased minimum wage, loan forgiveness programs, etc. I'm personally in favor of any option that enables citizens to get better paying jobs regardless of whether if debt is payed back or not. Most of the time, the government will easily make back its money through increased taxes on higher paying jobs and society benefits from having more people available to take on the advanced jobs.
It would also be useful to have clearly demarcated sections for the abstract, results, references, etc. Again, you could set BIO (Begin-In-Out) tags based on the section title and formatting style, but you may run into a few false positives if those words are used elsewhere in the text, and the two-column issue mentioned earlier may dump in text from other sections. Finally, there's little distinction between the body of the manuscript and the header/footer information.
Overall, the text is a bit messy. If you're just looking for keywords, then it's not a big deal. If you are trying to extract more complicated syntactic structures within the document, then it becomes a problem.
Also, while we're on the topic of text mining, would it be possible to get text-only or xml-based articles, with figures attached and cross-references as needed? It's quite annoying to manually convert a pdf when trying to setup an automated analysis over several documents. I know one could setup a shell script to dump it out using the pdftoxml converter, but the output is a bit messy to parse.