Documentation
The examples listed here can be found in
loanpy/docs/Loanpy_Documentation.ipynb
(See
GitHub)
to run them offline on your computer and on
Google Colab
to run them online in your browser.
Sound Correspondence Miner
The sound correspondence miner module contains several functions to
extract and manipulate linguistic data stored in tab-separated tables.
The main function is get_correspondences
, which extracts sound and prosodic
correspondences from the table and returns them as six dictionaries, each
with corresponding frequencies and COGID values. The module also includes
uralign
, a function that aligns Uralic input strings based on custom
rules, and get_heur
, which computes a heuristic mapping between phonemes
in a target language’s phoneme inventory and all phonemes in the IPA sound
system based on the Euclidean distance of their feature vectors. Finally,
get_prosodic_inventory
extracts all types of prosodic structures
from a target language in a given etymological table.
- loanpy.scminer.get_correspondences(table: List[List[str]], heur: Dict[str, List[str]] = '') List[Dict]
Get sound and prosodic correspondences from a given table string.
- Parameters:
table (list of lists) – A list of lists representing an etymological table. It must contain columns named
ALIGNMENT
,PROSODY
, andCOGID
.heur (dictionary with IPA characters as keys and a list of phonemes of a language's phoneme inventory ranked according to feature vector similarity as values.) – Optional dictionary containing heuristic correspondences to be merged with the output. Defaults to an empty string.
- Returns:
A list of six dictionaries containing correspondences and their frequencies:
Sound correspondences.
Frequency of sound correspondences.
COGID values for sound correspondences.
Prosodic correspondences.
Frequency of prosodic correspondences.
COGID values for prosodic correspondences.
- Return type:
list of six dictionaries
>>> from loanpy.scminer import get_correspondences >>> input_table = [ ... ['ID', 'COGID', 'DOCULECT', 'ALIGNMENT', 'PROSODY'], ... ['0', '1', 'LG1', 'a b', 'VC'], ... ['1', '1', 'LG2', 'c d', 'CC'] ... ] >>> get_correspondences(input_table) [{'a': ['c'], 'b': ['d']}, {'a c': 1, 'b d': 1}, {'a c': [1], 'b d': [1]}, {'VC': ['CC']}, {'VC CC': 1}, {'VC CC': [1]}]
- loanpy.scminer.get_heur(tgtlg: str) Dict[str, List[str]]
Rank the phonemes of a target langauge’s phoneme inventory according to feature vector similarity to all IPA sounds. Relies on
./cldf/.transcription-report.json
which contains the phoneme inventory. The fileipa_all.csv
contains all IPA sounds and their feature vectors and is shipped together with loanpy.- Parameters:
tgtlg (str) – The ID of the target language, as defined in
etc/languages.tsv
in a CLDF repository.- Returns:
A dictionary with IPA phonemes as keys and a list of closest target language phonemes as values.
- Return type:
dict
- Raises:
FileNotFoundError – If the data file or the transcription report file is not found.
>>> from loanpy.scminer import get_heur >>> get_heur("eng") {'˩': ['a', 'b'], '˨': ['a', 'b'], '˧': ['a', 'b'], '˦': ['a', 'b'], '˥': ['a', 'b'], ...
- loanpy.scminer.get_prosodic_inventory(table: List[List[str]]) List[str]
Extracts all types of prosodic structures (e.g. “CVCV”) from rows with an uneven ID (i.e. where data of target language is located) of the given table.
- Parameters:
table (list of lists) – A table where every row is a list.
- Returns:
A list of prosodic structures (e.g. “CVCV”) that occur in the target languages (i.e. in the uneven rows)
- Return type:
list
>>> from loanpy.scminer import get_prosodic_inventory >>> >>> data = [ ['ID', 'COGID', 'DOCULECT', 'ALIGNMENT', 'PROSODY'], ... [0, 1, 'H', '#aː t͡ʃ# -#', 'VC'], ... [1, 1, 'EAH', 'a.ɣ.a t͡ʃ i', 'VCVCV'], ... [2, 2, 'H', '#aː ɟ uː#', 'VCV'], ... [3, 2, 'EAH', 'a l.d a.ɣ', 'VCCVC'], ... [4, 3, 'H', '#ɒ j n', 'VCC'], ... [5, 3, 'EAH', 'a j.a n', 'VCVC'] ... ] >>> get_prosodic_inventory(data) ['VCVCV', 'VCCVC', 'VCVC']
- loanpy.scminer.uralign(left: str, right: str) str
Aligns the left and right input strings based on a custom alignment for Hungarian-preHungarian.
The function splits the input strings by space, aligns them one by one and squeezes the remainder of the longer string into one single block, which can be seen as a suffix.
It then returns the aligned strings, joined by a newline character.
- Parameters:
left (str) – The left input string with space-separated IPA-sounds.
right (str) – The right input string with space-separated IPA-sounds.
- Returns:
The aligned left and right strings, separated by a newline character.
- Return type:
str
>>> from loanpy.scminer import uralign >>> print(uralign("a b c", "f g h i j k").replace(" ", " ")) #a b c# -# f g h ijk
Sound Correspondence Applier
This module provides tools for predicting and analyzing changes in the horizontal or vertical transfer of words. It includes the Adrc class, which supports the adaptation and reconstruction of words based on sound and prosodic correspondences and inventories. The module also contains functions for repairing phonotactics.
Horizontal transfer refers to the borrowing of words between languages in contact, while vertical transfer refers to the inheritance of words from a parent language to its descendants.
- class loanpy.scapplier.Adrc(sc: Union[str, Path] = '', prosodic_inventory: Union[str, Path] = '')
Adapt or Reconstruct (ADRC) class.
This class provides functionality for automatically adapting or reconstructing words of a language, based on sound and prosodic correspondences and inventories. Inputs are generated by
loanpy.scminer
.- Parameters:
sc (str or pathlike object, optional) – The path to the sound-correspondence json-file containing the list of six dictionaries, outputted by
loanpy.scminer.get_correspondences
prosodic_inventory (str or pathlike object, optional) – Path to the prosodic inventory json-file.
>>> from loanpy.scapplier import Adrc >>> adrc = Adrc("examples/sc2.json", "examples/inv.json") >>> adrc.sc [{'d': ['d', 't'], 'a': ['a', 'o']}, {'d d': 5, 'd t': 4, 'a a': 7, 'a o': 1}, {}, {'CVCV': ['CVC']}] >>> adrc.prosodic_inventory ['CV', 'CVV']
- adapt(ipastr: Union[str, List[str]], howmany: int = 1, prosody: str = '') str
Predict the adaptation of a loanword in a target recipient language.
- Parameters:
ipastr (str) – Space-separated tokenised input IPA string.
howmany (int) – Number of adapted words to return. Default is 1.
prosody (str of 'C' and 'V') – Prosodic structure of the adapted words (e.g. “CVCV”). Default is an empty string. Providing this triggers phonotactic repair.
- Returns:
A list containing possible loanword adaptations.
- Return type:
list of str
>>> from loanpy.scapplier import Adrc >>> adrc = Adrc("examples/sc2.json", "examples/inv.json") >>> adrc.adapt("d a d a") ['dada'] >>> adrc.adapt("d a d a", 5) ['dada', 'data', 'doda', 'dota', 'tada'] >>> adrc.adapt("d a d a", 5, "CVCV") # sc2.json says CVCV to CVC ['dad', 'dat', 'dod', 'dot', 'tad'] >>> adrc.adapt("d a d", 5, "CVC") # no info on CVC in sc2.json ['da', 'do', 'ta', 'to'] # closest in inventory is "CV"
- get_closest_phonotactics(struc: str) str
Get the closest prosodic structure (e.g. “CVCV”) from the prosodic inventory of a given language based on edit distance with two operations.
- Parameters:
struc (str) – The phonotactic structure to compare against.
- Returns:
The closest prosodic structure (e.g. “CVCV”) in the prosodic inventory.
- Return type:
str
>>> from loanpy.scapplier import Adrc >>> adrc = Adrc("examples/sc2.json", "examples/inv.json") >>> adrc.get_closest_phonotactics("CVC") 'CV' >>> adrc.get_closest_phonotactics("CVCV") 'CVV'
- get_diff(sclistlist: List[List[str]], ipa: List[str]) List[int]
Computes the difference in the number of examples between the current and the next sound correspondences for each phoneme or cluster in a word.
- Parameters:
sclistlist (list) – A list of sound correspondence lists.
ipa (list) – A list of IPA symbols representing the word.
- Returns:
A list of differences between the number of examples for each sound correspondence in the input word.
- Return type:
list
>>> from loanpy.scapplier import Adrc >>> adrc = Adrc() >>> adrc.set_sc([{}, {"k k": 2, "k c": 1, "i e": 2, "i o": 1}, {}, {}, {}, {}, {}]) >>> sclistlist = [["k", "c", "$"], ["e", "o", "$"], ["k", "c", "$"], ["e", "o", "$"]] >>> adrc.get_diff(sclistlist, ["k", "i", "k", "i"]) [1, 1, 1, 1]
- read_sc(ipa: List[str], howmany: int = 1) List[List[str]]
Replaces every phoneme of a word with a list of phonemes that it can correspond to. The next phoneme it picks is always the one that makes the least difference in terms of absolute frequency.
- Parameters:
ipa (list) – a tokenized/clusterised word
howmany (int, default=1) – The desired number of possible combinations. This is the false positive rate if the prediction is wrong but the false positive rate -1 if the prediction is right.
- Returns:
The information about which sounds each input sound can correspond to.
- Return type:
list of lists
>>> from loanpy.scapplier import Adrc >>> adrc = Adrc() >>> adrc.set_sc([{"k": ["k", "h"], "i": ["e", "o"]}, ... {"k k": 5, "k c": 3, "i e": 2, "i o": 1}, ... {}, {}, {}, {}, {}]) >>> sclistlist = [["k", "c", "$"], ["e", "o", "$"], ["k", "c", "$"], ["e", "o", "$"]] >>> adrc.read_sc(["k", "i"], 2) [['k'], ['e', 'o']] # difference between i e and i o = 2 - 1 = 1 # and between k k and k c = 5 - 3 = 2 # so picking the "o" makes less of a difference than picking the "c"
- reconstruct(ipastr: str, howmany: int = 1) str
Reconstructs a phonological form from a given IPA string using a sound correspondence dictionary.
- Parameters:
ipastr (str) – A string of space-separated IPA symbols representing the phonetic form from which to predict a reconstruction.
howmany (int) – The maximum number of predicted reconstructions to return. Default is 1.
- Returns:
A regular expression of predicted reconstructions from the given IPA string, based on the sound correspondence dictionary. Or the phoneme plus “not old” if a phoneme is missing from the correspondence-dictionaries.
- Return type:
str
>>> from loanpy.scapplier import Adrc >>> adrc = Adrc("examples/sc2.json", "examples/inv.json") >>> adrc.reconstruct("d a d a") '^(d)(a)(d)(a)$' >>> adrc.reconstruct("d a d a", 1000) '^(d|t)(a|o)(d|t)(a|o)$' >>> adrc.reconstruct("l a l a") 'l not old'
- repair_phonotactics(ipalist: List[str], prosody: str) List[str]
Repairs the phonotactics (prosody) of an IPA string.
- Parameters:
ipalist (list of str) – A list of IPA symbols representing the input word.
prosody (str) – A string representing the prosodic structure of the input word.
- Returns:
A list of repaired IPA strings.
- Return type:
list of str
>>> from loanpy.scapplier import Adrc >>> adrc = Adrc("examples/sc2.json", "examples/inv.json") >>> adrc.repair_phonotactics(["d", "a", "d", "a"], "CVCV") ['d', 'a', 'd']
- set_prosodic_inventory(prosodic_inventory: List[str]) None
Method to set the phonotactic inventory manually. Called by
loanpy.eval_sca.eval_one
.- Parameters:
prosodic_inventory (list of strings) – The phonotactic inventory.
- Returns:
Set the attribute
.prosodic_inventory
- Return type:
None
>>> from loanpy.scapplier import Adrc >>> adrc = Adrc("examples/sc2.json", "examples/inv.json") >>> adrc.set_prosodic_inventory("rofl") >>> adrc.prosodic_inventory 'rofl'
- set_sc(sc: List[dict]) None
Method to set sound correspondences manually. Called by
loanpy.eval_sca.eval_one
.- Parameters:
sc (list of 6 dicts) – The sound correspondence dictionary.
- Returns:
Set the attribute
.sc
- Return type:
None
>>> from loanpy.scapplier import Adrc >>> adrc = Adrc("examples/sc2.json", "examples/inv.json") >>> adrc.set_sc("lol") >>> adrc.sc 'lol'
- loanpy.scapplier.add_edge(graph: Dict[Tuple[int, int], List[Tuple[int, int, int]]], u: Tuple[int, int], v: Tuple[int, int], weight: int) None
Add an edge to a graph. Called by
loanpy.scapplier.mtx2graph
.- Parameters:
graph (dict) – The graph to be populated
u (Tuple of two integers, e.g. (0, 0)) – Position of the starting node
v (Tuple of two integers, e.g. (0, 1)) – Position of the ending node
weight (int) – The weigt of the edge connecting the two nodes
- Returns:
Updates the graph in-place
- Return type:
None
>>> from loanpy.scapplier import add_edge >>> graph = {'A': {'B': 3}} >>> add_edge(graph, 'A', 'C', 7) >>> graph {'A': {'B': 3, 'C': 7}}
- loanpy.scapplier.apply_edit(word: Iterable[str], editops: List[str]) List[str]
Called by
loanpy.scapplier.Adrc.repair_phonotactics
. Applies a list of human readable edit operations to a string.- Parameters:
word (an iterable (e.g. list of phonemes, or string)) – The input word
editops (list or tuple of strings) – list of (human readable) edit operations
- Returns:
transformed input word
- Return type:
list of str
>>> from loanpy.scapplier import apply_edit >>> apply_edit( ... ['f', 'ɛ', 'r', 'i', 'h', 'ɛ', 'ɟ'], ... ('insert d', ... 'insert u', ... 'insert n', ... 'insert ɒ', ... 'insert p', ... 'substitute f by ɒ', ... 'delete ɛ', ... 'keep r', ... 'delete i', ... 'delete h', ... 'delete ɛ', ... 'substitute ɟ by t') ... ) ['d', 'u', 'n', 'ɒ', 'p', 'ɒ', 'r', 't']
- loanpy.scapplier.dijkstra(graph: Dict[Tuple[int, int], Dict[Tuple[int, int], int]], start: Tuple[int, int], end: Tuple[int, int]) Optional[List[Tuple[int, int]]]
Find the shortest path between two nodes in a weighted graph using Dijkstra’s algorithm.
Dijkstra’s algorithm is an algorithm for finding the shortest path between two nodes in a weighted graph. It maintains a priority queue of nodes to be expanded and their tentative distances from the start node. The algorithm iteratively extracts the node with the minimum tentative distance from the priority queue and updates the tentative distances of its neighbors if a shorter path is found.
- Parameters:
graph (dict) – A dictionary representing the weighted graph, where each key is a node and each value is a dictionary representing its neighbors and edge weights.
start (A tuple of two integers representing the node's position on the x and y axis.) – The starting node.
end (A tuple of two integers representing the node's position on the x and y axis.) – The ending node.
- Returns:
The shortest path between the start and end nodes, represented as a list of nodes in the order they are visited, or None if no path exists.
- Return type:
list or None
- Raises:
KeyError – If the start or end node is not in the graph.
>>> from loanpy.scapplier import dijkstra >>> graph1 = { ... 'A': {'B': 1, 'C': 4}, ... 'B': {'C': 2, 'D': 6}, ... 'C': {'D': 3}, ... 'D': {} ... } >>> dijkstra(graph1, 'A', 'D') ['A', 'B', 'C', 'D']
See also
- loanpy.scapplier.edit_distance_with2ops(string1: str, string2: str, w_del: Union[int, float] = 100, w_ins: Union[int, float] = 49) Union[int, float]
Called by
loanpy.scapplier.Adrc.get_closest_phonotactics
. Takes two strings and calculates their similarity by only allowing two operations: insertion and deletion. An algorithmic implementation of the “Threshold Principle” (Paradis and LaCharité 1997: 385)- Parameters:
string1 (str) – The first of two strings to be compared to each other
string2 (str) – The second of two strings to be compared to each other
w_del (int or float, default=100) – weight (cost) for deleting a phoneme. Default should always stay 100, since only relative costs between inserting and deleting count.
w_ins (int or float, default=49.) – weight (cost) for inserting a phoneme. Default 49 is in accordance with the “Threshold Principle”: 2 insertions (2*49=98) are cheaper than a deletion (100).
- Returns:
The distance between two input strings
- Return type:
int or float
>>> from loanpy.scapplier import edit_distance_with2ops >>> edit_distance_with2ops("rajka", "ajka", w_del=100, w_ins=49) 100 >>> edit_distance_with2ops("ajka", "rajka", w_del=100, w_ins=49) 49 >>> edit_distance_with2ops("Bécs", "Pécs", w_del=100, w_ins=49) 149 >>> edit_distance_with2ops("Hegyeshalom", "Mosonmagyaróvár", w_del=100, w_ins=49) 1388
- loanpy.scapplier.get_mtx(target: Iterable, source: Iterable) List[List[int]]
Called by
loanpy.scapplier.Adrc.repair_phonotactics
. Similar toloanpy.scapplier.edit_distance_with2ops
but without weights (i.e. deletion and insertion both always cost one) and the matrix is returned. Draws a matrix of minimum edit distances between every substring of two input strings.- Parameters:
target (iterable, e.g. str or list) – The target word
source (iterable, e.g. str or list) – The source word
- Returns:
A matrix where every cell tells the cost of turning one substring into the other (only delete and insert with cost 1 for each)
- Return type:
list of lists
>>> from loanpy.scapplier import get_mtx >>> get_mtx("Bécs", "Pécs") [[0, 1, 2, 3, 4], [1, 2, 3, 4, 5], [2, 3, 2, 3, 4], [3, 4, 3, 2, 3], [4, 5, 4, 3, 2]] # What happens under the hood: # deletion costs 1, insertion costs 1, so the distances are: # B C D E # hashtag stands for empty string # 0 1 2 3 4 # distance B-#=1, BC-#=2, BCD-#=3, BCDE-#=4 # D 1 2 3 2 3 # distance D-#=1, D-B=2, D-BC=3, D-BCD=2, D-BCDE=3 # E 2 3 4 3 2 # distance DE-#=2, DE-B=3, DE-BC=4, DE-BCD=3, DE-BCDE=2 # the min. edit distance from BCDE-DE=2: delete B, delete C
See also
- loanpy.scapplier.list2regex(sclist: List[str]) str
Called by
loanpy.scapplier.Adrc.reconstruct
. Turns a list of phonemes into a regular expression.- Parameters:
sclist (list of str) – a list of phonemes
- Returns:
The phonemes from the input list separated by a pipe. “-” is removed and replaced with a question mark at the end.
- Return type:
str
>>> from loanpy.scapplier import list2regex >>> list2regex(["b", "k", "-", "v"]) '(b|k|v)?'
- loanpy.scapplier.move_sc(sclistlist: List[List[str]], whichsound: int, out: List[List[str]]) Tuple[List[List[str]], List[List[str]]]
Moves a sound correspondence from the input list to the output list and updates both lists.
- Parameters:
sclistlist (list of lists) – A list of lists containing sound correspondences.
whichsound (int) – The index of the sound to be moved.
out (list of lists) – The output list where the sound correspondence will be moved to.
- Returns:
An updated tuple containing the modified sclistlist and out.
- Return type:
tuple of (list of lists, list of lists)
>>> from loanpy.scapplier import move_sc >>> move_sc([["x", "x"]], 0, [[]]) ([['x']], [['x']]) >>> move_sc([["x", "x"], ["y", "y"], ["z"]], 1, [["a"], ["b"], ["c"]]) ([['x', 'x'], ['y'], ['z']], [['a'], ['b', 'y'], ['c']])
- loanpy.scapplier.mtx2graph(matrix: List[List[int]], w_del: int = 100, w_ins: int = 49) Dict[Tuple[int, int], Dict[Tuple[int, int], int]]
Converts a distance-matrix to a weighted directed graph
- Parameters:
matrix (A list of list of integers) – The distance matrix, generated by
loanpy.apply.get_mtx
.- W_del:
Weight of deletions. According to the Threshold Principle of the Theory of Constraints and Repair Strategies (TCRS, Paradis and LaCharité 1997: 385), two insertions are cheaper than one deletion. Therefore, the weight of deletions, i.e. moving to the right through the matrix, is set to 49 by default.
- W_ins:
Weight of insertions, i.e. moving downwards through the matrix. Set to 100 by default, so that two insertions (2*49=98) are just cheaper than one deletion.
- Returns:
A directed graph with weighted edges
- Return type:
dictionary with tuples as keys and dictionaries as values. The value-dictionaries contain tuples as keys and weights (integers) as values. All tuples contain two integers that represent the position of a node in the matrix/graph, e.g. (0, 0).
>>> from loanpy.scapplier import mtx2graph >>> mtx2graph([[0, 1, 2], [1, 2, 3], [2, 3, 2]]) {(0, 0): {(0, 1): 100, (1, 0): 49}, (0, 1): {(0, 2): 100, (1, 1): 49}, (0, 2): {(1, 2): 49}, (1, 0): {(1, 1): 100, (2, 0): 49}, (1, 1): {(1, 2): 100, (2, 1): 49, (2, 2): 0}, (1, 2): {(2, 2): 49}, (2, 0): {(2, 1): 100}, (2, 1): {(2, 2): 100}, (2, 2): {}}
- loanpy.scapplier.substitute_operations(operations: List[str]) List[str]
Replaces subsequent “delete, insert” / “insert, delete” operations with “substitute”. Called by
loanpy.apply.tuples2editops
.- Parameters:
operations (List of strings, e.g. ['insert l', 'delete h', 'keep ó']) – A list of human readable edit operations
- Returns:
Updated operations
- Return type:
List of strings, e.g. [‘substitute l by h’, ‘keep ó’]
>>> from loanpy.scapplier import substitute_operations >>> substitute_operations(['insert A', 'delete B', 'insert C']) ['substitute B by A', 'insert C'] >>> substitute_operations(['delete A', 'insert B', 'delete C', 'insert D']) ['substitute A by B', 'substitute C by D']
- loanpy.scapplier.tuples2editops(op_list: List[Tuple[int, int]], s1: str, s2: str) List[str]
Called by
loanpy.scapplier.editops
. The path through the graph by whichstring1
is converted tostring2
is given in form of tuples that contain the x and y coordinates of every step through the matrix shaped graph. This function converts those numerical instructions to human readable ones. The x values stand for movement from left to right, y values for movement downwards. Movement downwards means deletion, movement to the right means insertion. Diagonal movement means the value is kept. Moving to the right and downards or downwards and to the right after each other means substitution.- Parameters:
op_list (list of tuples of 2 int) – The numeric list of edit operations
s1 (str) – The first of two strings to be compared to each other
s2 (str) – The second of two strings to be compared to each other
- Returns:
list of human readable edit operations
- Return type:
list of strings
>>> from loanpy.scapplier import tuples2editops >>> tuples2editops([(0, 0), (0, 1), (1, 1), (2, 2)], "ló", "hó") ['substitute l by h', 'insert ó'] >>> tuples2editops([(0, 0), (1, 1), (2, 2), (2, 3)], "lóh", "ló") ['keep l', 'keep ó', 'delete h']
Evaluate Sound Correspondence Applier
This module focuses on evaluating the quality of adapted and reconstructed words. It processes the input data, which consists of tokenised IPA source and target strings, as well as prosodic strings, and extracts and applies correspondences to predict the best possible adaptations or reconstructions. The module then calculates the accuracy of the predictions by counting the relative number of false positives (how many guesses) vs true positives. Overall, this module aims to facilitate a deeper understanding of loanword adaptation and historical sound change processes by quantifying the success rate of predictive models.
- loanpy.eval_sca.eval_all(intable: List[List[str]], heur: Dict[str, List[str]], adapt: bool, guess_list: List[int], pros: bool = False) List[Tuple[int, int]]
Input a loanpy-compatible table containing etymological data.
Start a nested for-loop for
The first loop goes through the number of guesses (~ false positives)
The second loop performs leave-one-out cross validation with
loanpy.eval_sca.eval_one
.The output is a list of tuples containing the relative number of true positives vs. relative number of false positives
- Parameters:
intable (list of lists) – The input tsv-table. Space-separated tokenised IPA source and target strings must be in column “ALIGNMENT”, prosodic strings in column “PROSODY”.
heur (str or pathlike object, optional) – The path to the heuristic sound correspondences file, e.g. “heur.json”, which was created with
loanpy.scminer.get_heur
.adapt (bool) – Set to
True
to make predictions withloanpy.scapplier.Adrc.adapt
, set toFalse
to make predictions withloanpy.scapplier.Adrc.reconstruct
.guess_list (list of int) – The list of number of guesses to evaluate.
pros (bool, default=False) – Wheter phonotactic repairs should be applied
- Returns:
A list of tuples of integer-pairs representing false positives vs true positives
- Return type:
list of tuples of integers
>>> from loanpy.eval_sca import eval_all >>> intable = [ ['ID', 'COGID', 'DOCULECT', 'ALIGNMENT', 'PROSODY'], ... ['0', '1', 'H', 'k i k i', 'VC'], ... ['1', '1', 'EAH', 'k i g i', 'VCVCV'], ... ['2', '2', 'H', 'i k k i', 'VCV'], ... ['3', '2', 'EAH', 'i g k i', 'VCCVC'] ... ] >>> >>> eval_all(intable, "", False, [1, 2, 3]) [(0.33, 0.0), (0.67, 1.0), (1.0, 1.0)]
- loanpy.eval_sca.eval_one(intable: List[List[str]], heur: Dict[str, List[str]], adapt: bool, howmany: int, pros: bool = False) float
Called by
loanpy.eval.eval_all
. Loops through the loanpy-compatible etymological input-table and performs leave-one-out cross validation. The result is how many words were correctly predicted, relative to the total number of predictions made.- Parameters:
intable (list of lists) – The input tsv-table. Space-separated tokenised IPA source and target strings must be in column “ALIGNMENT”, prosodic strings in column “PROSODY”.
heur (str or pathlike object, optional) – The path to the heuristic sound correspondences file, e.g. “heur.json”, which was created with
loanpy.scminer.get_heur
.adapt (bool) – Set to
True
to make predictions withloanpy.scapplier.Adrc.adapt
, set toFalse
to make predictions withloanpy.scapplier.Adrc.reconstruct
.howmany (list) – Howmany guesses should be made. Treated as false positives.
pros (bool, default=False) – Wheter phonotactic repairs should be applied
- Returns:
A tuple with the ratio of successful predictions (rounded to 2 decimal places).
- Return type:
float
>>> from loanpy.eval_sca import eval_one >>> intable = [ # regular sound correspondences ... ['ID', 'COGID', 'DOCULECT', 'ALIGNMENT', 'PROSODY'], ... ['0', '1', 'H', 'k i k i', 'VC'], ... ['1', '1', 'EAH', 'g i g i', 'VCVCV'], ... ['2', '2', 'H', 'i k k i', 'VCV'], ... ['3', '2', 'EAH', 'i g g i', 'VCCVC'] ... ] >>> eval_one(intable, "", False, 1) 1.0 >>> intable = [ # not enough regular sound correspondences ... ['ID', 'COGID', 'DOCULECT', 'ALIGNMENT', 'PROSODY'], ... ['0', '1', 'H', 'k i k i', 'VC'], ... ['1', '1', 'EAH', 'g i g i', 'VCVCV'], ... ['2', '2', 'H', 'b u b a', 'VCV'], ... ['3', '2', 'EAH', 'p u p a', 'VCCVC'] ... ] >>> eval_one(intable, "", False, 1) 0.0 >>> intable = [ # irregular sound correspondences ... ['ID', 'COGID', 'DOCULECT', 'ALIGNMENT', 'PROSODY'], ... ['0', '1', 'H', 'k i k i', 'VC'], ... ['1', '1', 'EAH', 'k i g i', 'VCVCV'], ... ['2', '2', 'H', 'i k k i', 'VCV'], ... ['3', '2', 'EAH', 'i g k i', 'VCCVC'] ... ] >>> eval_one(intable, "", False, 1) 0.0 >>> intable = [ # irregular sound correspondences ... ['ID', 'COGID', 'DOCULECT', 'ALIGNMENT', 'PROSODY'], ... ['0', '1', 'H', 'k i k i', 'VC'], ... ['1', '1', 'EAH', 'k i g i', 'VCVCV'], ... ['2', '2', 'H', 'i k k i', 'VCV'], ... ['3', '2', 'EAH', 'i g k i', 'VCCVC'] ... ] >>> eval_one(intable, "", False, 2) # increase rate of false positives 1.0
Loan Finder
This module is designed to identify potential loanwords between a hypothesised donor and recipient language. It processes two input dataframes, one representing the donor language with predicted adapted forms and the other the recipient language with predicted reconstructions. The module first finds phonetic matches between the two languages and then calculates their semantic similarity. The output is a list of candidate loanwords, which can be further analysed manually.
The two functions in this module are responsible for finding phonetic matches between the given donor and recipient language data and calculating their semantic similarity. These functions process the input dataframes and compare the phonetic patterns, as well as calculate the semantic similarity based on a user-provided function. The module returns a list of candidate loanwords that show phonetic and semantic similarities. The output can then be used to propose lexical borrowings, adaptation patterns, and historical reconstructions for words of the proposed donor and recipient language.
- loanpy.loanfinder.phonetic_matches(df_rc: List[List[str]], df_ad: List[List[str]], output: Union[str, Path]) None
Find phonetic matches between the given donor and recipient TSV files.
The function processes the donor and recipient data frames, compares the phonetic patterns, and writes the result as a tsv-file to the specified output-path.
- Parameters:
df_ad (list of lists. Column 2 (index 1) must be a foreign key, and Column 3 (index 2) a predicted loanword adaptation.) – Table of the donor language data with adapted forms.
df_rc (list of lists. Column 2 (index 1) must be a foreign key, and Column 3 (index 2) a predicted reconstruction, ideally a regular expression.) – Table of the recipient language data with reconstructed forms.
output (str or pathlike object) – The path to the output-file
- Returns:
writes a tsv-file containing the matched data, with the following columns:
ID
– the primary key of the table,ID_rc
– the foreign key of the reconstruction,ID_ad
– the foreign key of the adaptation.- Return type:
None
>>> from loanpy.loanfinder import phonetic_matches >>> donor = [ ... ['a0', 'Donorese-0', 'igig'], ... ['a1', 'Donorese-1', 'iggi'] ... ] >>> recipient = [ ... ['0', 'Recipientese-0', '^(i|u)(g)(g)(i|u)$'], ... ['1', 'Recipientese-1', '^(i|u)(i|u)(g)(g)$'] ... ] >>> outpath = "examples/phonetic_matches.tsv" >>> phonetic_matches(recipient, donor, outpath) >>> with open(outpath, "r") as f: ... print(f.read()) ID ID_rc ID_ad 0 Recipientese-0 Donorese-1
- loanpy.loanfinder.semantic_matches(df_phonmatch: List[List[str]], get_semsim: Callable[[Any, Any], Union[float, int]], output: Union[str, Path], thresh: Union[int, float] = 0) None
Calculate the semantic similarity between pairs of rows in the input data frame using the provided semantic-similarity scoring function, add the information about the semantic similarity to each row. Write the file to the provided output-path.
- Parameters:
df_phonmatch (list of lists of strings) – phonetic matches tsv, generated by
loanpy.find.phonetic_matches
. Each sublist represents a row of data. The first sublist should contain the header row, and each subsequent sublist should contain the data for one row. The meanings have to be in columns 4 and 5 (index 3 and 4).get_semsim (function) – A function that calculates the semantic similarity between two strings.
output (str or pathlike object) – The path to the output-file
thresh (float, int) – The threshold above which semantic matches count
- Returns:
writes a tsv-file representing semantic matches in the input table with the added column of semantic similarity.
- Return type:
None
>>> from loanpy.loanfinder import semantic_matches >>> def getsemsim(x, y): >>> return 3 >>> phmtsv = [ ... ["ID", "ID_rc", "ID_ad"], ... ["0", "Recipientese-0", "Donorese-1", "cat", "dog"], ... ] >>> outpath = "examples/phonetic_matches.tsv" >>> semantic_matches(phmtsv, getsemsim, outpath) >>> with open(outpath, "r") as f: ... print(f.read()) ID ID_rc ID_ad semsim 0 Recipientese-0 Donorese-1 0.75
Utility Functions
Module focusing on functions to support generating and preprocessing loanpy-compatible input data.
This module contains functions for optimal year cutoffs, manipulating IPA data, and processing cognate sets. It provides helper functions for reading and processing linguistic datasets and performing various operations such as filtering and validation.
- class loanpy.utils.IPA
Class built on loanpy’s modified version of panphon’s
ipa_all.csv
table to handle certain tasks that require IPA-data.>>> from loanpy.utils import IPA >>> ipa = IPA() >>> type(ipa.vowels) list >>> len(ipa.vowels) 1464 >>> ipa.vowels[0] 'ʋ̥'
- get_clusters(segments: Iterable[str]) str
Takes a list of phonemes and segments them into consonant and vowel clusters, like so: “abcdeaofgh” -> [“a”, “b.c.d”, “e.a.o”, “f.g.h”]
- Parameters:
segments (iterable) – A word, ideally as a list of IPA symbols
- Returns:
Same word but with consonants and vowels clustered together
- Return type:
str
>>> from loanpy.utils import IPA >>> ipa = IPA() >>> ipa.get_clusters(["r", "a", "u", "f", "l"]) 'r a.u f.l'
- get_cv(ipastr: str) str
This method takes an IPA string as input and returns either “V” if the string is a vowel or “C” if it is a consonant.
- Parameters:
ipastr (str) – An IPA string representing a phonetic character.
- Returns:
A string “V” if the input IPA string is a vowel, or “C” if it is a consonant.
- Return type:
str
>>> from loanpy.utils import IPA >>> ipa = IPA() >>> ipa.get_cv("p") 'C' >>> ipa.get_cv("u") 'V'
- get_prosody(ipastr: str) str
Generate a prosodic string from an IPA string.
This function takes an IPA string as input and generates a prosodic string by classifying each phoneme as a vowel (V) or consonant (C).
- Parameters:
ipastr (str) – The tokenised input IPA string. Phonemes must be separated by space or dot.
- Returns:
The generated prosodic string.
- Return type:
str
>>> from loanpy.utils import IPA >>> ipa = IPA() >>> ipa.get_prosody("l o l") 'CVC' >>> ipa.get_prosody("r o f.l") 'CVCC'
- loanpy.utils.cvgaps(str1: str, str2: str) List[str]
Replace gaps in the first input string based on the second input string.
This function takes two aligned strings, replaces “-” in the first string with either “C” (consonant) or “V” (vowel) depending on the corresponding character in the second string, and returns the new strings as a list.
- Parameters:
str1 (str) – The first aligned input string.
str2 (str) – The second aligned input string.
- Returns:
A list containing the modified first string and the unchanged second string.
- Return type:
list of strings
>>> from loanpy.utils import cvgaps >>> cvgaps("b l -", "b l a") ['b l V', 'b l a'] >>> cvgaps("b - a", "b l a") ['b C a', 'b l a']
- loanpy.utils.find_optimal_year_cutoff(tsv: List[List[str]], origins: Iterable) int
Determine the optimal year cutoff for a given dataset and origins.
This function reads TSV content from a given dataset and origins, calculates the accumulated count of words with the specified origin until each given year, and finds the optimal year cutoff using the euclidean distance to the upper left corner in a coordinate system where the relative increase of years is on the x-axis and the relative increase in the cumulative number of words is on the y-axis.
- Parameters:
tsv (list of list of strings) – A table where the first row is the header
origins (a set of strings) – A set of origins to be considered for counting words.
- Returns:
The optimal year cutoff for the dataset and origins.
- Return type:
int
>>> from loanpy.utils import find_optimal_year_cutoff >>> tsv = [ ... ['form', 'sense', 'Year', 'Etymology', 'Loan'], ... ['gulyás', 'goulash, Hungarian stew', '1800', 'unknown', ''], ... ['Tisza', 'a major river in Hungary', '1230', 'uncertain', ''], ... ['Pest', 'part of Budapest, the capital', '1241', 'Slavic', 'True'], ... ['paprika', 'ground red pepper, spice', '1598', 'Slavic', 'True'] ... ] >>> find_optimal_year_cutoff(tsv, "Slavic") 1241
- loanpy.utils.is_same_length_alignments(data: List[List[str]]) bool
Check if alignments within a cognate set have the same length.
This function iterates over the input data and asserts that the alignments within each cognate set have the same length. Alignments are expected to be in column 4 (index 3).
- Parameters:
data (list of list of strings) – A list of lists containing language data. No header.
- Returns:
True if all alignments within each cognate set have the same length, False otherwise.
- Return type:
bool
>>> from loanpy.utils import is_same_length_alignments >>> is_same_length_alignments([[0, 1, 2, "a - c", 4, 5], [0, 1, 2, "d e f", 4, 5]]) True >>> is_same_length_alignments([[0, 1, 2, "a b c", 4, 5], [0, 1, 2, "d e", 4, 5]]) 2023-04-25 23:08:05,042 - INFO - 0 ['a', '-', 'c'] ['d', 'e'] False
- loanpy.utils.is_valid_language_sequence(data: List[List[str]], source_lang: str, target_lang: str) bool
Validate if the data has a valid alternating sequence of source and target language.
The data is expected to have language IDs in the third column (index 2). The sequence should be: source_lang, target_lang, source_lang, target_lang, …
- Parameters:
data (list) – A list of lists containing language data. No header.
source_lang (str) – The expected source language ID.
target_lang (str) – The expected target language ID.
- Returns:
True if the sequence is valid, False otherwise.
- Return type:
bool
>>> from loanpy.utils import is_valid_language_sequence >>> data = [ ... ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'], ... ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'], ... ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'], ... ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'], ... ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x'], ... ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x'] ... ] >>> is_valid_language_sequence(data, "de", "en") True >>> from loanpy.utils import is_valid_language_sequence >>> data = [ ... ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'], ... ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'], ... ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'], ... ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'], ... ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x'], ... ['x', 'x', 'nl', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x'] ... ] >>> is_valid_language_sequence(data, "de", "en") 2023-04-25 23:04:07,532 - INFO - Problem in row 5 False
- loanpy.utils.modify_ipa_all(input_file: Union[str, Path], output_file: Union[str, Path]) None
Original file is
ipa_all.csv
from folderdata
in panphon 0.20.0 and was copied with the permission of its author. Theipa_all.csv
table of loanpy was created with this function. Following modifications are undertaken:All
+
signs are replaced by1
, all-
signs by-1
Two phonemes are appended to the column
ipa
, namely “C”, and “V”, meaning “any consonant”, and “any vowel”.Any phoneme containing “j”, “w”, or “ʔ” is redefined as a consonant.
- Parameters:
input_file (A string or a path-like object) – The path to the file
ipa_all.csv
.output_file (A string or a path-like object) – The name and path of the new csv-file that is to be written.
- Returns:
Write new file
- Return type:
None
- loanpy.utils.prefilter(data: List[List[str]], srclg: str, tgtlg: str) List[List[str]]
Filter dataset to keep only cognate sets where both source and target languages occur.
This function filters the input dataset to retain only the cognate sets where both source and target languages are present. The filtered dataset is then sorted based on cognate set ID and language ID.
- Parameters:
data (list of list of strings) – A list of lists containing language data. Columns
Language_ID
andCognacy
must be provided.srclg (str) – The source language ID to be considered.
tgtlg (str) – The target language ID to be considered.
- Returns:
A filtered and sorted list of lists containing cognate sets with both source and target languages present.
- Return type:
list of list of strings
>>> from loanpy.utils import prefilter >>> data = [ ... ['x', 'x', 'Language_ID', 'x', 'x', 'x', 'x', 'x', 'x', 'Cognacy', 'x'], ... ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'], ... ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'], ... ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'], ... ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'], ... ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '2', 'x'], ... ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '3', 'x'], ... ['x', 'x', 'nl', 'x', 'x', 'x', 'x', 'x', 'x', '4', 'x'], ... ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '4', 'x'], ... ['x', 'x', 'nl', 'x', 'x', 'x', 'x', 'x', 'x', '5', 'x'], ... ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '5', 'x'], ... ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x'], ... ['x', 'x', 'nl', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x'], ... ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x'] ... ] >>> prefilter(data, "de", "en") [['x', 'x', 'Language_ID', 'x', 'x', 'x', 'x', 'x', 'x', 'Cognacy', 'x'], ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'], ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'], ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'], ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'], ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x'], ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x']]
- loanpy.utils.prod(iterable: Iterable[Union[int, float]]) Union[int, float]
Calculate the product of all elements in an iterable.
This function takes an iterable (e.g., list, tuple) as input and computes the product of all its elements. This function had to be hard-coded because
from math import prod
caused incompatibility issues with some python versions on certain platforms.- Parameters:
iterable (Iterable[int] or Iterable[float]) – The input iterable containing numbers.
- Returns:
The product of all elements in the input iterable.
- Return type:
int or float
>>> from loanpy.utils import prod >>> prod([1, 2, 3]) # one times two times three 6
- loanpy.utils.read_ipa_all() List[List[str]]
This function reads the
ipa_all.csv
table located in the same directory as the loanpy-modules and returns it as a list of lists.- Returns:
A list of lists containing IPA data read from
ipa_all.csv
.- Return type:
list of list of strings
>>> from loanpy.utils import read_ipa_all >>> ipa_all = read_ipa_all() >>> type(ipa_all) list >>> len(ipa_all) 6492 >>> ipa_all[:2] [['ipa', 'syl', 'son', 'cons', 'cont', 'delrel', 'lat', 'nas', 'strid', 'voi', 'sg', 'cg', 'ant', 'cor', 'distr', 'lab', 'hi', 'lo', 'back', 'round', 'velaric', 'tense', 'long', 'hitone', 'hireg'], ['˩', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '-1', '-1']]
- loanpy.utils.scjson2tsv(jsonin: Union[str, Path], outtsv: Union[str, Path], outtsv_phonotactics: Union[str, Path]) None
Turn a computer-readable sound correspondence json-file into a human readbale tab separated value file (tsv).
read json
put information into columns
write file
- Parameters:
jsonin (str or path-like object) – The name of the json-file containing the sound correspondences to be converted
outtsv (str or path-like object) – The name of the output file containing the sound correspondences. Should end in “.tsv”.
outtsv_phonotactics – The name of the output file containing the phonotactic (=prosodic) correspondences. Should end in “.tsv”.
- Returns:
Write two tsv-files to the specified two output paths
- Return type:
None
>>> from loanpy.utils import scjson2tsv >>> scjson2tsv("sc.json", "sc.tsv", "sc_p.tsv") >>> with open("sc.tsv", "r") as f: ... print(f.read()) sc src tgt freq CogID a o a o 1 512 a e a e 2 3, 4 >>> with open("sc_p.tsv", "r") as f: ... print(f.read()) sc src tgt freq CogID CV CV CV CV 1 7