Describe and understand the world through data.
Data collection and data comparison are the foundations of scientific research. Mathematics provides the abstract framework to describe patterns we observe in nature and Statistics provides the framework to quantify the uncertainty of these patterns. In statistics, natural patterns are described in form of probability distributions which either follow a fixed pattern (parametric distributions) or more dynamic patterns (non-parametric distributions).
The philentropy
package implements fundamental distance and similarity measures to quantify distances between probability density functions as well as traditional information theory measures. In this regard, it aims to provide a framework for comparing natural patterns in a statistical notation.
This project is born out of my passion for statistics and I hope that it will be useful to the people who share it with me.
I am developing philentropy
in my spare time and would be very grateful if you would consider citing the following paper in case philentropy
was useful for your own research. I plan on maintaining and extending the philentropy
functionality and usability in the next years and require citations to back up these efforts. Many thanks in advance :)
HG Drost, (2018). Philentropy: Information Theory and Distance Quantification with R. Journal of Open Source Software, 3(26), 765. https://doi.org/10.21105/joss.00765
[1] "euclidean" "manhattan" "minkowski"
[4] "chebyshev" "sorensen" "gower"
[7] "soergel" "kulczynski_d" "canberra"
[10] "lorentzian" "intersection" "non-intersection"
[13] "wavehedges" "czekanowski" "motyka"
[16] "kulczynski_s" "tanimoto" "ruzicka"
[19] "inner_product" "harmonic_mean" "cosine"
[22] "hassebrook" "jaccard" "dice"
[25] "fidelity" "bhattacharyya" "hellinger"
[28] "matusita" "squared_chord" "squared_euclidean"
[31] "pearson" "neyman" "squared_chi"
[34] "prob_symm" "divergence" "clark"
[37] "additive_symm" "kullback-leibler" "jeffreys"
[40] "k_divergence" "topsoe" "jensen-shannon"
[43] "jensen_difference" "taneja" "kumar-johnson"
[46] "avg"
# define a probability density function P
P <- 1:10/sum(1:10)
# define a probability density function Q
Q <- 20:29/sum(20:29)
# combine P and Q as matrix object
x <- rbind(P,Q)
# compute the jensen-shannon distance between
# probability density functions P and Q
philentropy::distance(x, method = "jensen-shannon")
jensen-shannon using unit 'log'.
jensen-shannon
0.02628933
Alternatively, users can also retrieve values from all available distance/similarity metrics using philentropy::dist.diversity()
:
euclidean manhattan
0.12807130 0.35250464
minkowski chebyshev
0.12807130 0.06345083
sorensen gower
0.17625232 0.03525046
soergel kulczynski_d
0.29968454 0.42792793
canberra lorentzian
2.09927095 0.49712136
intersection non-intersection
0.82374768 0.17625232
wavehedges czekanowski
3.16657887 0.17625232
motyka kulczynski_s
0.58812616 2.33684211
tanimoto ruzicka
0.29968454 0.70031546
inner_product harmonic_mean
0.10612245 0.94948528
cosine hassebrook
0.93427641 0.86613103
jaccard dice
0.13386897 0.07173611
fidelity bhattacharyya
0.97312397 0.03930448
hellinger matusita
0.32787819 0.23184489
squared_chord squared_euclidean
0.05375205 0.01640226
pearson neyman
0.16814418 0.36742465
squared_chi prob_symm
0.10102943 0.20205886
divergence clark
1.49843905 0.86557468
additive_symm kullback-leibler
0.53556883 0.13926288
jeffreys k_divergence
0.31761069 0.04216273
topsoe jensen-shannon
0.07585498 0.03792749
jensen_difference taneja
0.03792749 0.04147518
kumar-johnson avg
0.62779644 0.20797774
# install.packages("devtools")
# install the current version of philentropy on your system
library(devtools)
install_github("HajkD/philentropy", build_vignettes = TRUE, dependencies = TRUE)
The current status of the package as well as a detailed history of the functionality of each version of philentropy
can be found in the NEWS section.
distance()
: Implements 46 fundamental probability distance (or similarity) measuresgetDistMethods()
: Get available method names for ‘distance’dist.diversity()
: Distance Diversity between Probability Density Functionsestimate.probability()
: Estimate Probability Vectors From Count VectorsH()
: Shannon’s Entropy H(X)JE()
: Joint-Entropy H(X,Y)CE()
: Conditional-Entropy H(X | Y)MI()
: Shannon’s Mutual Information I(X,Y)KL()
: Kullback–Leibler DivergenceJSD()
: Jensen-Shannon DivergencegJSD()
: Generalized Jensen-Shannon Divergencephilentropy
package
An atlas of gene regulatory elements in adult mouse cerebrum YE Li, S Preissl, X Hou, Z Zhang, K Zhang et al.- Nature, 2021
Convergent somatic mutations in metabolism genes in chronic liver disease S Ng, F Rouhani, S Brunner, N Brzozowska et al. Nature, 2021
Antigen dominance hierarchies shape TCF1+ progenitor CD8 T cell phenotypes in tumors ML Burger, AM Cruz, GE Crossland et al. - Cell, 2021
High-content single-cell combinatorial indexing R Mulqueen et al. - Nature Biotechnology, 2021
Extinction at the end-Cretaceous and the origin of modern Neotropical rainforests MR Carvalho, C Jaramillo et al. - Science, 2021
HERMES: a molecular-formula-oriented method to target the metabolome R Giné, J Capellades, JM Badia et al. - Nature Methods, 2021
The genetic architecture of temperature adaptation is shaped by population ancestry and not by selection regime KA Otte, V Nolte, F Mallard et al. - Genome Biology, 2021
Gut microbiome-mediated metabolism effects on immunity in rural and urban African populations M Stražar, GS Temba, H Vlamakis et al. - Nature Communications, 2021
Aging, inflammation and DNA damage in the somatic testicular niche with idiopathic germ cell aplasia M Alfano, AS Tascini, F Pederzoli et al. - Nature communications, 2021
Single cell census of human kidney organoids shows reproducibility and diminished off-target cells after transplantation A Subramanian et al. - Nature Communications, 2019
Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche C Coupé, YM Oh, D Dediu, F Pellegrino - Science Advances, 2019
Loss of adaptive capacity in asthmatic patients revealed by biomarker fluctuation dynamics after rhinovirus challenge A Sinha et al. - eLife, 2019
The Tug1 lncRNA locus is essential for male fertility JP Lewandowski et al. - Genome Biology, 2020
Sex and hatching order modulate the association between MHC‐II diversity and fitness in early‐life stages of a wild seabird M Pineaux et al - Molecular Ecology, 2020
How the Choice of Distance Measure Influences the Detection of Prior-Data Conflict K Lek, R Van De Schoot - Entropy, 2019
Differential variation analysis enables detection of tumor heterogeneity using single-cell RNA-sequencing data EF Davis-Marcisak, TD Sherman et al. - Cancer research, 2019
Multi-Omics Investigation of Innate Navitoclax Resistance in Triple-Negative Breast Cancer Cells M Marczyk et al. - Cancers, 2020
Impact of Gut Microbiome on Hypertensive Patients with Low-Salt Intake: Shika Study Results S Nagase et al. - Frontiers in Medicine, 2020
Combined TCR Repertoire Profiles and Blood Cell Phenotypes Predict Melanoma Patient Response to Personalized Neoantigen Therapy plus Anti-PD-1 A Poran et al. - Cell Reports Medicine, 2020
Phenotyping of acute and persistent COVID-19 features in the outpatient setting: exploratory analysis of an international cross-sectional online survey S Sahanic, P Tymoszuk, D Ausserhofer et al. - medRxiv, 2021
A two-part evaluation approach for measuring the usability and user experience of an Augmented Reality-based assistance system to support the temporal coordination of spatially dispersed teams L Thomaschewski, B Weyers, A Kluge - Cognitive Systems Research, 2021
SEDE-GPS: socio-economic data enrichment based on GPS information T Sperlea, S Füser, J Boenigk, D Heider - BMC bioinformatics, 2018
Evacuees and Migrants Exhibit Different Migration Systems after the Great East Japan Earthquake and Tsunami M Hauer, S Holloway, T Oda – 2019
Robust comparison of similarity measures in analogy based software effort estimation P Phannachitta - 11th International Conference on Software, 2017
RUNIMC - An R-based package for imaging mass cytometry data analysis and pipeline validation L Dolcetti, PR Barber, G Weitsman, S Thavarajet al. - bioRxiv, 2021
Expression variation analysis for tumor heterogeneity in single-cell RNA-sequencing data EF Davis-Marcisak, P Orugunta et al. - BioRxiv, 2018
Concept acquisition and improved in-database similarity analysis for medical data I Wiese, N Sarna, L Wiese, A Tashkandi, U Sax - Distributed and Parallel Databases, 2019
Dynamics of Vaginal and Rectal Microbiota over Several Menstrual Cycles in Female Cynomolgus Macaques MT Nugeyre, N Tchitchek, C Adapen et al. - Frontiers in Cellular and Infection Microbiology, 2019
Inferring the quasipotential landscape of microbial ecosystems with topological data analysis WK Chang, L Kelly - BioRxiv, 2019
Shifts in the nasal microbiota of swine in response to different dosing regimens of oxytetracycline administration KT Mou, HK Allen, DP Alt, J Trachsel et al. - Veterinary microbiology, 2019
The Patchy Distribution of Restriction–Modification System Genes and the Conservation of Orphan Methyltransferases in Halobacteria MS Fullmer, M Ouellette, AS Louyakis et al. - Genes, 2019
Genetic differentiation and intrinsic genomic features explain variation in recombination hotspots among cocoa tree populations EJ Schwarzkopf, JC Motamayor, OE Cornejo - BioRxiv, 2019
Metastable regimes and tipping points of biochemical networks with potential applications in precision medicine SS Samal, J Krishnan, AH Esfahani et al. - Reasoning for Systems Biology and Medicine, 2019
Genome‐wide characterization and developmental expression profiling of long non‐coding RNAs in Sogatella furcifera ZX Chang, OE Ajayi, DY Guo, QF Wu - Insect science, 2019
Development of a simulation system for modeling the stock market to study its characteristics P Mariya – 2018
The Tug1 Locus is Essential for Male Fertility JP Lewandowski, G Dumbović, AR Watson, T Hwang et al. - BioRxiv, 2019
Microbiotyping the sinonasal microbiome A Bassiouni, S Paramasivan, A Shiffer et al. - BioRxiv, 2019
Critical search: A procedure for guided reading in large-scale textual corpora J Guldi - Journal of Cultural Analytics, 2018
A Bibliography of Publications about the R, S, and S-Plus Statistics Programming Languages NHF Beebe – 2019
Improved state change estimation in dynamic functional connectivity using hidden semi-Markov models H Shappell, BS Caffo, JJ Pekar, MA Lindquist - NeuroImage, 2019
A Smart Recommender Based on Hybrid Learning Methods for Personal Well-Being Services RM Nouh, HH Lee, WJ Lee, JD Lee - Sensors, 2019
Cognitive Structural Accuracy V Frenz – 2019
Kidney organoid reproducibility across multiple human iPSC lines and diminished off target cells after transplantation revealed by single cell transcriptomics A Subramanian, EH Sidhom, M Emani et al. - BioRxiv, 2019
Multi-classifier majority voting analyses in provenance studies on iron artefacts G Żabiński et al. - Journal of Archaeological Science, 2020
Identifying inhibitors of epithelial–mesenchymal plasticity using a network topology-based approach K Hari et al. - NPJ systems biology and applications, 2020
Genetic differentiation and intrinsic genomic features explain variation in recombination hotspots among cocoa tree populations EJ Schwarzkopf et al. - BMC Genomics, 2020
Enhancing Card Sorting Dendrograms through the Holistic Analysis of Distance Methods and Linkage Criteria. JA Macías - Journal of Usability Studies, 2021
Pattern-based identification and mapping of landscape types using multi-thematic data J Nowosad, TF Stepinski - International Journal of Geographical Information, 2021
Motif Analysis in k-mer Networks: An Approach towards Understanding SARS-CoV-2 Geographical Shifts S Biswas, S Saha, S Bandyopadhyay, M Bhattacharyya - bioRxiv, 2020
Motif: an open-source R tool for pattern-based spatial analysis J Nowosad - Landscape Ecology, 2021
New effective spectral matching measures for hyperspectral data analysis C Kumar, S Chatterjee, T Oommen, A Guha - International Journal of Remote Sensing, 2021
Innovative activity of Polish enterprises–a strategic aspect. The similarity of NACE divisions E Bielińska-Dusza, M Hamerska - Journal of Entrepreneurship, Management and innovation, 2021
Multi-classifier majority voting analyses in provenance studies on iron artefacts G Żabiński, J Gramacki et al.- Journal of Archaeological Science, 2020
I would be very happy to learn more about potential improvements of the concepts and functions provided in this package.
Furthermore, in case you find some bugs or need additional (more flexible) functionality of parts of this package, please let me know:
https://github.com/drostlab/philentropy/issues
or find me on twitter: HajkDrost