audubon is Japanese text processing tools for:
Some features above are not implemented in ‘ICU’ (i.e., the stringi package), and the goal of the audubon package is to provide these additional features.
::install_github("paithiov909/audubon") remotes
strj_fill_iter_mark
repeats the previous character and
replaces the iteration marks if the element has more than 5 characters.
You can use this feature with strj_normalize
or
strj_rewrite_as_def
.
strj_fill_iter_mark(c("あいうゝ〃かき",
"金子みすゞ",
"のたり〳〵かな",
"しろ/″\とした"))
#> [1] "あいうううかき" "金子みすず" "のたりたりかな" "しろじろとした"
strj_fill_iter_mark("いすゞエルフトラック") |>
strj_normalize()
#> [1] "いすずエルフトラック"
Character class conversion uses hakatashi/japanese.js.
strj_hiraganize("あのイーハトーヴォのすきとおった風")
#> [1] "あのいーはとーゔぉのすきとおった風"
strj_katakanize("あのイーハトーヴォのすきとおった風")
#> [1] "アノイーハトーヴォノスキトオッタ風"
strj_romanize("あのイーハトーヴォのすきとおった風")
#> [1] "anoīhatōvonosukitōtta"
strj_segment
splits Japanese text into some phrases
using google/budoux and
TinySegmenter.
strj_segment("あのイーハトーヴォのすきとおった風")
#> $`1`
#> [1] "あのイーハトーヴォの" "すきと" "おった"
#> [4] "風"
strj_normalize
normalizes text following the rule based
on NEologd
style.
strj_normalize("――南アルプスの 天然水- Sparking* Lemon+ レモン一絞り")
#> [1] "ー南アルプスの天然水-Sparking* Lemon+レモン一絞り"
strj_rewrite_as_def
is an R port of SudachiCharNormalizer
that typically normalizes characters following a ’*.def’ file.
audubon package contains several ’*.def’ files, so you can use them or write a ‘rewrite.def’ file by yourself as follows.
# single characters will **never** be normalized.
…
# if two characters are separated with a tab,
# left side forms are always rewritten to right side forms
# before normalized.
斎 斉
齋 斉
齊 斉
# supports rewriting a single character to a single character,
# i.e., this cannot work.
アッ ア
This feature is more powerful than stringi::stri_trans_*
because it allows users to control which characters are normalized. For
instance, this function can be used to convert kyuji-tai
characters to shinji-tai characters.
::stri_trans_nfkc("Ⅹⅳ")
stringi#> [1] "Xiv"
strj_rewrite_as_def("Ⅹⅳ")
#> [1] "Ⅹⅳ"
strj_rewrite_as_def("惡と假面のルール", read_rewrite_def(system.file("def/kyuji.def", package = "audubon")))
#> [1] "悪と仮面のルール"
© 2022 Akiru Kato
Licensed under the Apache License, Version 2.0.
Icons made by iconixar from www.flaticon.com.