Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe <https://github.com/kohlschutter/boilerpipe> Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.
Version: | 1.3.2 |
Imports: | rJava |
Suggests: | RCurl |
Published: | 2021-05-19 |
Author: | See AUTHORS file. boilerpipeR author details |
Maintainer: | Mario Annau <mario.annau at gmail.com> |
BugReports: | https://github.com/mannau/boilerpipeR/issues |
License: | Apache License (== 2.0) |
URL: | https://github.com/mannau/boilerpipeR |
NeedsCompilation: | no |
Materials: | NEWS |
In views: | NaturalLanguageProcessing, WebTechnologies |
CRAN checks: | boilerpipeR results |
Reference manual: | boilerpipeR.pdf |
Vignettes: |
Introduction to the tm.plugin.webmining Package |
Package source: | boilerpipeR_1.3.2.tar.gz |
Windows binaries: | r-devel: boilerpipeR_1.3.2.zip, r-release: boilerpipeR_1.3.2.zip, r-oldrel: boilerpipeR_1.3.2.zip |
macOS binaries: | r-release (arm64): boilerpipeR_1.3.2.tgz, r-oldrel (arm64): boilerpipeR_1.3.2.tgz, r-release (x86_64): boilerpipeR_1.3.2.tgz, r-oldrel (x86_64): boilerpipeR_1.3.2.tgz |
Old sources: | boilerpipeR archive |
Please use the canonical form https://CRAN.R-project.org/package=boilerpipeR to link to this page.