Documentation

The examples listed here can be found in loanpy/docs/Loanpy_Documentation.ipynb (See GitHub) to run them offline on your computer and on Google Colab to run them online in your browser.

Sound Correspondence Miner

The sound correspondence miner module contains several functions to extract and manipulate linguistic data stored in tab-separated tables. The main function is get_correspondences, which extracts sound and prosodic correspondences from the table and returns them as six dictionaries, each with corresponding frequencies and COGID values. The module also includes uralign, a function that aligns Uralic input strings based on custom rules, and get_heur, which computes a heuristic mapping between phonemes in a target language’s phoneme inventory and all phonemes in the IPA sound system based on the Euclidean distance of their feature vectors. Finally, get_prosodic_inventory extracts all types of prosodic structures from a target language in a given etymological table.

loanpy.scminer.get_correspondences(table: List[List[str]], heur: Dict[str, List[str]] = '') List[Dict]

Get sound and prosodic correspondences from a given table string.

Parameters:
  • table (list of lists) – A list of lists representing an etymological table. It must contain columns named ALIGNMENT, PROSODY, and COGID.

  • heur (dictionary with IPA characters as keys and a list of phonemes of a language's phoneme inventory ranked according to feature vector similarity as values.) – Optional dictionary containing heuristic correspondences to be merged with the output. Defaults to an empty string.

Returns:

A list of six dictionaries containing correspondences and their frequencies:

  1. Sound correspondences.

  2. Frequency of sound correspondences.

  3. COGID values for sound correspondences.

  4. Prosodic correspondences.

  5. Frequency of prosodic correspondences.

  6. COGID values for prosodic correspondences.

Return type:

list of six dictionaries

Run in Google Colab >>

>>> from loanpy.scminer import get_correspondences
>>> input_table = [
...     ['ID', 'COGID', 'DOCULECT', 'ALIGNMENT', 'PROSODY'],
...     ['0', '1', 'LG1', 'a b', 'VC'],
...     ['1', '1', 'LG2', 'c d', 'CC']
... ]
>>> get_correspondences(input_table)
[{'a': ['c'], 'b': ['d']},
 {'a c': 1, 'b d': 1},
 {'a c': [1], 'b d': [1]},
 {'VC': ['CC']},
 {'VC CC': 1},
 {'VC CC': [1]}]
loanpy.scminer.get_heur(tgtlg: str) Dict[str, List[str]]

Rank the phonemes of a target langauge’s phoneme inventory according to feature vector similarity to all IPA sounds. Relies on ./cldf/.transcription-report.json which contains the phoneme inventory. The file ipa_all.csv contains all IPA sounds and their feature vectors and is shipped together with loanpy.

Parameters:

tgtlg (str) – The ID of the target language, as defined in etc/languages.tsv in a CLDF repository.

Returns:

A dictionary with IPA phonemes as keys and a list of closest target language phonemes as values.

Return type:

dict

Raises:

FileNotFoundError – If the data file or the transcription report file is not found.

Run in Google Colab >>

>>> from loanpy.scminer import get_heur
>>> get_heur("eng")
{'˩': ['a', 'b'],
 '˨': ['a', 'b'],
 '˧': ['a', 'b'],
 '˦': ['a', 'b'],
 '˥': ['a', 'b'],
 ...
loanpy.scminer.get_prosodic_inventory(table: List[List[str]]) List[str]

Extracts all types of prosodic structures (e.g. “CVCV”) from rows with an uneven ID (i.e. where data of target language is located) of the given table.

Parameters:

table (list of lists) – A table where every row is a list.

Returns:

A list of prosodic structures (e.g. “CVCV”) that occur in the target languages (i.e. in the uneven rows)

Return type:

list

Run in Google Colab >>

>>> from loanpy.scminer import get_prosodic_inventory
>>>
>>> data = [  ['ID', 'COGID', 'DOCULECT', 'ALIGNMENT', 'PROSODY'],
...   [0, 1, 'H', '#aː t͡ʃ# -#', 'VC'],
...   [1, 1, 'EAH', 'a.ɣ.a t͡ʃ i', 'VCVCV'],
...   [2, 2, 'H', '#aː ɟ uː#', 'VCV'],
...   [3, 2, 'EAH', 'a l.d a.ɣ', 'VCCVC'],
...   [4, 3, 'H', '#ɒ j n', 'VCC'],
...   [5, 3, 'EAH', 'a j.a n', 'VCVC']
... ]
>>> get_prosodic_inventory(data)
['VCVCV', 'VCCVC', 'VCVC']
loanpy.scminer.uralign(left: str, right: str) str

Aligns the left and right input strings based on a custom alignment for Hungarian-preHungarian.

The function splits the input strings by space, aligns them one by one and squeezes the remainder of the longer string into one single block, which can be seen as a suffix.

It then returns the aligned strings, joined by a newline character.

Parameters:
  • left (str) – The left input string with space-separated IPA-sounds.

  • right (str) – The right input string with space-separated IPA-sounds.

Returns:

The aligned left and right strings, separated by a newline character.

Return type:

str

Run in Google Colab >>

>>> from loanpy.scminer import uralign
>>> print(uralign("a b c", "f g h i j k").replace(" ", "        "))
#a      b       c#      -#
f       g       h       ijk

Sound Correspondence Applier

This module provides tools for predicting and analyzing changes in the horizontal or vertical transfer of words. It includes the Adrc class, which supports the adaptation and reconstruction of words based on sound and prosodic correspondences and inventories. The module also contains functions for repairing phonotactics.

Horizontal transfer refers to the borrowing of words between languages in contact, while vertical transfer refers to the inheritance of words from a parent language to its descendants.

class loanpy.scapplier.Adrc(sc: Union[str, Path] = '', prosodic_inventory: Union[str, Path] = '')

Adapt or Reconstruct (ADRC) class.

This class provides functionality for automatically adapting or reconstructing words of a language, based on sound and prosodic correspondences and inventories. Inputs are generated by loanpy.scminer.

Parameters:
  • sc (str or pathlike object, optional) – The path to the sound-correspondence json-file containing the list of six dictionaries, outputted by loanpy.scminer.get_correspondences

  • prosodic_inventory (str or pathlike object, optional) – Path to the prosodic inventory json-file.

Run in Google Colab >>

>>> from loanpy.scapplier import Adrc
>>> adrc = Adrc("examples/sc2.json", "examples/inv.json")
>>> adrc.sc
[{'d': ['d', 't'], 'a': ['a', 'o']},
 {'d d': 5, 'd t': 4, 'a a': 7, 'a o': 1},
 {},
 {'CVCV': ['CVC']}]
>>> adrc.prosodic_inventory
['CV', 'CVV']
adapt(ipastr: Union[str, List[str]], howmany: int = 1, prosody: str = '') str

Predict the adaptation of a loanword in a target recipient language.

Parameters:
  • ipastr (str) – Space-separated tokenised input IPA string.

  • howmany (int) – Number of adapted words to return. Default is 1.

  • prosody (str of 'C' and 'V') – Prosodic structure of the adapted words (e.g. “CVCV”). Default is an empty string. Providing this triggers phonotactic repair.

Returns:

A list containing possible loanword adaptations.

Return type:

list of str

Run in Google Colab >>

>>> from loanpy.scapplier import Adrc
>>> adrc = Adrc("examples/sc2.json", "examples/inv.json")
>>> adrc.adapt("d a d a")
['dada']
>>> adrc.adapt("d a d a", 5)
['dada', 'data', 'doda', 'dota', 'tada']
>>> adrc.adapt("d a d a", 5, "CVCV")  # sc2.json says CVCV to CVC
['dad', 'dat', 'dod', 'dot', 'tad']
>>> adrc.adapt("d a d", 5, "CVC")   # no info on CVC in sc2.json
['da', 'do', 'ta', 'to']
# closest in inventory is "CV"
get_closest_phonotactics(struc: str) str

Get the closest prosodic structure (e.g. “CVCV”) from the prosodic inventory of a given language based on edit distance with two operations.

Parameters:

struc (str) – The phonotactic structure to compare against.

Returns:

The closest prosodic structure (e.g. “CVCV”) in the prosodic inventory.

Return type:

str

Run in Google Colab >>

>>> from loanpy.scapplier import Adrc
>>> adrc = Adrc("examples/sc2.json", "examples/inv.json")
>>> adrc.get_closest_phonotactics("CVC")
'CV'
>>> adrc.get_closest_phonotactics("CVCV")
'CVV'
get_diff(sclistlist: List[List[str]], ipa: List[str]) List[int]

Computes the difference in the number of examples between the current and the next sound correspondences for each phoneme or cluster in a word.

Parameters:
  • sclistlist (list) – A list of sound correspondence lists.

  • ipa (list) – A list of IPA symbols representing the word.

Returns:

A list of differences between the number of examples for each sound correspondence in the input word.

Return type:

list

Run in Google Colab >>

>>> from loanpy.scapplier import Adrc
>>> adrc = Adrc()
>>> adrc.set_sc([{}, {"k k": 2, "k c": 1, "i e": 2, "i o": 1}, {}, {}, {}, {}, {}])
>>> sclistlist = [["k", "c", "$"], ["e", "o", "$"], ["k", "c", "$"], ["e", "o", "$"]]
>>> adrc.get_diff(sclistlist, ["k", "i", "k", "i"])
[1, 1, 1, 1]
read_sc(ipa: List[str], howmany: int = 1) List[List[str]]

Replaces every phoneme of a word with a list of phonemes that it can correspond to. The next phoneme it picks is always the one that makes the least difference in terms of absolute frequency.

Parameters:
  • ipa (list) – a tokenized/clusterised word

  • howmany (int, default=1) – The desired number of possible combinations. This is the false positive rate if the prediction is wrong but the false positive rate -1 if the prediction is right.

Returns:

The information about which sounds each input sound can correspond to.

Return type:

list of lists

Run in Google Colab >>

>>> from loanpy.scapplier import Adrc
>>> adrc = Adrc()
>>> adrc.set_sc([{"k": ["k", "h"], "i": ["e", "o"]},
...              {"k k": 5, "k c": 3, "i e": 2, "i o": 1},
...              {}, {}, {}, {}, {}])
>>> sclistlist = [["k", "c", "$"], ["e", "o", "$"], ["k", "c", "$"], ["e", "o", "$"]]
>>> adrc.read_sc(["k", "i"], 2)
[['k'], ['e', 'o']]
# difference between i e and i o = 2 - 1 = 1
# and between k k and k c = 5 - 3 = 2
# so picking the "o" makes less of a difference than picking the "c"
reconstruct(ipastr: str, howmany: int = 1) str

Reconstructs a phonological form from a given IPA string using a sound correspondence dictionary.

Parameters:
  • ipastr (str) – A string of space-separated IPA symbols representing the phonetic form from which to predict a reconstruction.

  • howmany (int) – The maximum number of predicted reconstructions to return. Default is 1.

Returns:

A regular expression of predicted reconstructions from the given IPA string, based on the sound correspondence dictionary. Or the phoneme plus “not old” if a phoneme is missing from the correspondence-dictionaries.

Return type:

str

Run in Google Colab >>

>>> from loanpy.scapplier import Adrc
>>> adrc = Adrc("examples/sc2.json", "examples/inv.json")
>>> adrc.reconstruct("d a d a")
'^(d)(a)(d)(a)$'
>>> adrc.reconstruct("d a d a", 1000)
'^(d|t)(a|o)(d|t)(a|o)$'
>>> adrc.reconstruct("l a l a")
'l not old'
repair_phonotactics(ipalist: List[str], prosody: str) List[str]

Repairs the phonotactics (prosody) of an IPA string.

Parameters:
  • ipalist (list of str) – A list of IPA symbols representing the input word.

  • prosody (str) – A string representing the prosodic structure of the input word.

Returns:

A list of repaired IPA strings.

Return type:

list of str

Run in Google Colab >>

>>> from loanpy.scapplier import Adrc
>>> adrc = Adrc("examples/sc2.json", "examples/inv.json")
>>> adrc.repair_phonotactics(["d", "a", "d", "a"], "CVCV")
['d', 'a', 'd']
set_prosodic_inventory(prosodic_inventory: List[str]) None

Method to set the phonotactic inventory manually. Called by loanpy.eval_sca.eval_one.

Parameters:

prosodic_inventory (list of strings) – The phonotactic inventory.

Returns:

Set the attribute .prosodic_inventory

Return type:

None

Run in Google Colab >>

>>> from loanpy.scapplier import Adrc
>>> adrc = Adrc("examples/sc2.json", "examples/inv.json")
>>> adrc.set_prosodic_inventory("rofl")
>>> adrc.prosodic_inventory
'rofl'
set_sc(sc: List[dict]) None

Method to set sound correspondences manually. Called by loanpy.eval_sca.eval_one.

Parameters:

sc (list of 6 dicts) – The sound correspondence dictionary.

Returns:

Set the attribute .sc

Return type:

None

Run in Google Colab >>

>>> from loanpy.scapplier import Adrc
>>> adrc = Adrc("examples/sc2.json", "examples/inv.json")
>>> adrc.set_sc("lol")
>>> adrc.sc
'lol'
loanpy.scapplier.add_edge(graph: Dict[Tuple[int, int], List[Tuple[int, int, int]]], u: Tuple[int, int], v: Tuple[int, int], weight: int) None

Add an edge to a graph. Called by loanpy.scapplier.mtx2graph.

Parameters:
  • graph (dict) – The graph to be populated

  • u (Tuple of two integers, e.g. (0, 0)) – Position of the starting node

  • v (Tuple of two integers, e.g. (0, 1)) – Position of the ending node

  • weight (int) – The weigt of the edge connecting the two nodes

Returns:

Updates the graph in-place

Return type:

None

Run in Google Colab >>

>>> from loanpy.scapplier import add_edge
>>> graph = {'A': {'B': 3}}
>>> add_edge(graph, 'A', 'C', 7)
>>> graph
{'A': {'B': 3, 'C': 7}}
loanpy.scapplier.apply_edit(word: Iterable[str], editops: List[str]) List[str]

Called by loanpy.scapplier.Adrc.repair_phonotactics. Applies a list of human readable edit operations to a string.

Parameters:
  • word (an iterable (e.g. list of phonemes, or string)) – The input word

  • editops (list or tuple of strings) – list of (human readable) edit operations

Returns:

transformed input word

Return type:

list of str

Run in Google Colab >>

>>> from loanpy.scapplier import apply_edit
>>> apply_edit(
...       ['f', 'ɛ', 'r', 'i', 'h', 'ɛ', 'ɟ'],
...       ('insert d',
...        'insert u',
...        'insert n',
...        'insert ɒ',
...        'insert p',
...        'substitute f by ɒ',
...        'delete ɛ',
...        'keep r',
...        'delete i',
...        'delete h',
...        'delete ɛ',
...        'substitute ɟ by t')
... )
['d', 'u', 'n', 'ɒ', 'p', 'ɒ', 'r', 't']
loanpy.scapplier.dijkstra(graph: Dict[Tuple[int, int], Dict[Tuple[int, int], int]], start: Tuple[int, int], end: Tuple[int, int]) Optional[List[Tuple[int, int]]]

Find the shortest path between two nodes in a weighted graph using Dijkstra’s algorithm.

Dijkstra’s algorithm is an algorithm for finding the shortest path between two nodes in a weighted graph. It maintains a priority queue of nodes to be expanded and their tentative distances from the start node. The algorithm iteratively extracts the node with the minimum tentative distance from the priority queue and updates the tentative distances of its neighbors if a shorter path is found.

Parameters:
  • graph (dict) – A dictionary representing the weighted graph, where each key is a node and each value is a dictionary representing its neighbors and edge weights.

  • start (A tuple of two integers representing the node's position on the x and y axis.) – The starting node.

  • end (A tuple of two integers representing the node's position on the x and y axis.) – The ending node.

Returns:

The shortest path between the start and end nodes, represented as a list of nodes in the order they are visited, or None if no path exists.

Return type:

list or None

Raises:

KeyError – If the start or end node is not in the graph.

Run in Google Colab >>

>>> from loanpy.scapplier import dijkstra
>>> graph1 = {
...         'A': {'B': 1, 'C': 4},
...         'B': {'C': 2, 'D': 6},
...         'C': {'D': 3},
...         'D': {}
...     }
>>> dijkstra(graph1, 'A', 'D')
['A', 'B', 'C', 'D']
loanpy.scapplier.edit_distance_with2ops(string1: str, string2: str, w_del: Union[int, float] = 100, w_ins: Union[int, float] = 49) Union[int, float]

Called by loanpy.scapplier.Adrc.get_closest_phonotactics. Takes two strings and calculates their similarity by only allowing two operations: insertion and deletion. An algorithmic implementation of the “Threshold Principle” (Paradis and LaCharité 1997: 385)

Parameters:
  • string1 (str) – The first of two strings to be compared to each other

  • string2 (str) – The second of two strings to be compared to each other

  • w_del (int or float, default=100) – weight (cost) for deleting a phoneme. Default should always stay 100, since only relative costs between inserting and deleting count.

  • w_ins (int or float, default=49.) – weight (cost) for inserting a phoneme. Default 49 is in accordance with the “Threshold Principle”: 2 insertions (2*49=98) are cheaper than a deletion (100).

Returns:

The distance between two input strings

Return type:

int or float

Run in Google Colab >>

>>> from loanpy.scapplier import edit_distance_with2ops
>>> edit_distance_with2ops("rajka", "ajka", w_del=100, w_ins=49)
100
>>> edit_distance_with2ops("ajka", "rajka", w_del=100, w_ins=49)
49
>>> edit_distance_with2ops("Bécs", "Pécs", w_del=100, w_ins=49)
149
>>> edit_distance_with2ops("Hegyeshalom", "Mosonmagyaróvár", w_del=100, w_ins=49)
1388
loanpy.scapplier.get_mtx(target: Iterable, source: Iterable) List[List[int]]

Called by loanpy.scapplier.Adrc.repair_phonotactics. Similar to loanpy.scapplier.edit_distance_with2ops but without weights (i.e. deletion and insertion both always cost one) and the matrix is returned. Draws a matrix of minimum edit distances between every substring of two input strings.

Parameters:
  • target (iterable, e.g. str or list) – The target word

  • source (iterable, e.g. str or list) – The source word

Returns:

A matrix where every cell tells the cost of turning one substring into the other (only delete and insert with cost 1 for each)

Return type:

list of lists

Run in Google Colab >>

>>> from loanpy.scapplier import get_mtx
>>> get_mtx("Bécs", "Pécs")
[[0, 1, 2, 3, 4],
 [1, 2, 3, 4, 5],
 [2, 3, 2, 3, 4],
 [3, 4, 3, 2, 3],
 [4, 5, 4, 3, 2]]
# What happens under the hood:
# deletion costs 1, insertion costs 1, so the distances are:
# B C D E  # hashtag stands for empty string
# 0 1 2 3 4  # distance B-#=1, BC-#=2, BCD-#=3, BCDE-#=4
# D 1 2 3 2 3  # distance D-#=1, D-B=2, D-BC=3, D-BCD=2, D-BCDE=3
# E 2 3 4 3 2  # distance DE-#=2, DE-B=3, DE-BC=4, DE-BCD=3, DE-BCDE=2
# the min. edit distance from BCDE-DE=2: delete B, delete C
loanpy.scapplier.list2regex(sclist: List[str]) str

Called by loanpy.scapplier.Adrc.reconstruct. Turns a list of phonemes into a regular expression.

Parameters:

sclist (list of str) – a list of phonemes

Returns:

The phonemes from the input list separated by a pipe. “-” is removed and replaced with a question mark at the end.

Return type:

str

Run in Google Colab >>

>>> from loanpy.scapplier import list2regex
>>> list2regex(["b", "k", "-", "v"])
'(b|k|v)?'
loanpy.scapplier.move_sc(sclistlist: List[List[str]], whichsound: int, out: List[List[str]]) Tuple[List[List[str]], List[List[str]]]

Moves a sound correspondence from the input list to the output list and updates both lists.

Parameters:
  • sclistlist (list of lists) – A list of lists containing sound correspondences.

  • whichsound (int) – The index of the sound to be moved.

  • out (list of lists) – The output list where the sound correspondence will be moved to.

Returns:

An updated tuple containing the modified sclistlist and out.

Return type:

tuple of (list of lists, list of lists)

Run in Google Colab >>

>>> from loanpy.scapplier import move_sc
>>> move_sc([["x", "x"]], 0, [[]])
([['x']], [['x']])
>>> move_sc([["x", "x"], ["y", "y"], ["z"]], 1, [["a"], ["b"], ["c"]])
([['x', 'x'], ['y'], ['z']], [['a'], ['b', 'y'], ['c']])
loanpy.scapplier.mtx2graph(matrix: List[List[int]], w_del: int = 100, w_ins: int = 49) Dict[Tuple[int, int], Dict[Tuple[int, int], int]]

Converts a distance-matrix to a weighted directed graph

Parameters:

matrix (A list of list of integers) – The distance matrix, generated by loanpy.apply.get_mtx.

W_del:

Weight of deletions. According to the Threshold Principle of the Theory of Constraints and Repair Strategies (TCRS, Paradis and LaCharité 1997: 385), two insertions are cheaper than one deletion. Therefore, the weight of deletions, i.e. moving to the right through the matrix, is set to 49 by default.

W_ins:

Weight of insertions, i.e. moving downwards through the matrix. Set to 100 by default, so that two insertions (2*49=98) are just cheaper than one deletion.

Returns:

A directed graph with weighted edges

Return type:

dictionary with tuples as keys and dictionaries as values. The value-dictionaries contain tuples as keys and weights (integers) as values. All tuples contain two integers that represent the position of a node in the matrix/graph, e.g. (0, 0).

Run in Google Colab >>

>>> from loanpy.scapplier import mtx2graph
>>> mtx2graph([[0, 1, 2], [1, 2, 3], [2, 3, 2]])
{(0, 0): {(0, 1): 100, (1, 0): 49},
 (0, 1): {(0, 2): 100, (1, 1): 49},
 (0, 2): {(1, 2): 49},
 (1, 0): {(1, 1): 100, (2, 0): 49},
 (1, 1): {(1, 2): 100, (2, 1): 49, (2, 2): 0},
 (1, 2): {(2, 2): 49},
 (2, 0): {(2, 1): 100},
 (2, 1): {(2, 2): 100},
 (2, 2): {}}
loanpy.scapplier.substitute_operations(operations: List[str]) List[str]

Replaces subsequent “delete, insert” / “insert, delete” operations with “substitute”. Called by loanpy.apply.tuples2editops.

Parameters:

operations (List of strings, e.g. ['insert l', 'delete h', 'keep ó']) – A list of human readable edit operations

Returns:

Updated operations

Return type:

List of strings, e.g. [‘substitute l by h’, ‘keep ó’]

Run in Google Colab >>

>>> from loanpy.scapplier import substitute_operations
>>> substitute_operations(['insert A', 'delete B', 'insert C'])
['substitute B by A', 'insert C']
>>> substitute_operations(['delete A', 'insert B', 'delete C', 'insert D'])
['substitute A by B', 'substitute C by D']
loanpy.scapplier.tuples2editops(op_list: List[Tuple[int, int]], s1: str, s2: str) List[str]

Called by loanpy.scapplier.editops. The path through the graph by which string1 is converted to string2 is given in form of tuples that contain the x and y coordinates of every step through the matrix shaped graph. This function converts those numerical instructions to human readable ones. The x values stand for movement from left to right, y values for movement downwards. Movement downwards means deletion, movement to the right means insertion. Diagonal movement means the value is kept. Moving to the right and downards or downwards and to the right after each other means substitution.

Parameters:
  • op_list (list of tuples of 2 int) – The numeric list of edit operations

  • s1 (str) – The first of two strings to be compared to each other

  • s2 (str) – The second of two strings to be compared to each other

Returns:

list of human readable edit operations

Return type:

list of strings

Run in Google Colab >>

>>> from loanpy.scapplier import tuples2editops
>>> tuples2editops([(0, 0), (0, 1), (1, 1), (2, 2)], "ló", "hó")
['substitute l by h', 'insert ó']
>>> tuples2editops([(0, 0), (1, 1), (2, 2), (2, 3)], "lóh", "ló")
['keep l', 'keep ó', 'delete h']

Evaluate Sound Correspondence Applier

This module focuses on evaluating the quality of adapted and reconstructed words. It processes the input data, which consists of tokenised IPA source and target strings, as well as prosodic strings, and extracts and applies correspondences to predict the best possible adaptations or reconstructions. The module then calculates the accuracy of the predictions by counting the relative number of false positives (how many guesses) vs true positives. Overall, this module aims to facilitate a deeper understanding of loanword adaptation and historical sound change processes by quantifying the success rate of predictive models.

loanpy.eval_sca.eval_all(intable: List[List[str]], heur: Dict[str, List[str]], adapt: bool, guess_list: List[int], pros: bool = False) List[Tuple[int, int]]
  1. Input a loanpy-compatible table containing etymological data.

  2. Start a nested for-loop for

  3. The first loop goes through the number of guesses (~ false positives)

  4. The second loop performs leave-one-out cross validation with loanpy.eval_sca.eval_one.

  5. The output is a list of tuples containing the relative number of true positives vs. relative number of false positives

Parameters:
  • intable (list of lists) – The input tsv-table. Space-separated tokenised IPA source and target strings must be in column “ALIGNMENT”, prosodic strings in column “PROSODY”.

  • heur (str or pathlike object, optional) – The path to the heuristic sound correspondences file, e.g. “heur.json”, which was created with loanpy.scminer.get_heur.

  • adapt (bool) – Set to True to make predictions with loanpy.scapplier.Adrc.adapt, set to False to make predictions with loanpy.scapplier.Adrc.reconstruct.

  • guess_list (list of int) – The list of number of guesses to evaluate.

  • pros (bool, default=False) – Wheter phonotactic repairs should be applied

Returns:

A list of tuples of integer-pairs representing false positives vs true positives

Return type:

list of tuples of integers

Run in Google Colab >>

>>> from loanpy.eval_sca import eval_all
>>> intable = [  ['ID', 'COGID', 'DOCULECT', 'ALIGNMENT', 'PROSODY'],
...   ['0', '1', 'H', 'k i k i', 'VC'],
...   ['1', '1', 'EAH', 'k i g i', 'VCVCV'],
...   ['2', '2', 'H', 'i k k i', 'VCV'],
...   ['3', '2', 'EAH', 'i g k i', 'VCCVC']
... ]
>>>
>>> eval_all(intable, "", False, [1, 2, 3])
[(0.33, 0.0), (0.67, 1.0), (1.0, 1.0)]
loanpy.eval_sca.eval_one(intable: List[List[str]], heur: Dict[str, List[str]], adapt: bool, howmany: int, pros: bool = False) float

Called by loanpy.eval.eval_all. Loops through the loanpy-compatible etymological input-table and performs leave-one-out cross validation. The result is how many words were correctly predicted, relative to the total number of predictions made.

Parameters:
  • intable (list of lists) – The input tsv-table. Space-separated tokenised IPA source and target strings must be in column “ALIGNMENT”, prosodic strings in column “PROSODY”.

  • heur (str or pathlike object, optional) – The path to the heuristic sound correspondences file, e.g. “heur.json”, which was created with loanpy.scminer.get_heur.

  • adapt (bool) – Set to True to make predictions with loanpy.scapplier.Adrc.adapt, set to False to make predictions with loanpy.scapplier.Adrc.reconstruct.

  • howmany (list) – Howmany guesses should be made. Treated as false positives.

  • pros (bool, default=False) – Wheter phonotactic repairs should be applied

Returns:

A tuple with the ratio of successful predictions (rounded to 2 decimal places).

Return type:

float

Run in Google Colab >>

>>> from loanpy.eval_sca import eval_one
>>> intable = [  # regular sound correspondences
...     ['ID', 'COGID', 'DOCULECT', 'ALIGNMENT', 'PROSODY'],
...     ['0', '1', 'H', 'k i k i', 'VC'],
...     ['1', '1', 'EAH', 'g i g i', 'VCVCV'],
...     ['2', '2', 'H', 'i k k i', 'VCV'],
...     ['3', '2', 'EAH', 'i g g i', 'VCCVC']
... ]
>>> eval_one(intable, "", False, 1)
1.0

>>> intable = [  # not enough regular sound correspondences
...   ['ID', 'COGID', 'DOCULECT', 'ALIGNMENT', 'PROSODY'],
...   ['0', '1', 'H', 'k i k i', 'VC'],
...   ['1', '1', 'EAH', 'g i g i', 'VCVCV'],
...   ['2', '2', 'H', 'b u b a', 'VCV'],
...   ['3', '2', 'EAH', 'p u p a', 'VCCVC']
... ]
>>> eval_one(intable, "", False, 1)
0.0

>>> intable = [  # irregular sound correspondences
...   ['ID', 'COGID', 'DOCULECT', 'ALIGNMENT', 'PROSODY'],
...   ['0', '1', 'H', 'k i k i', 'VC'],
...   ['1', '1', 'EAH', 'k i g i', 'VCVCV'],
...   ['2', '2', 'H', 'i k k i', 'VCV'],
...   ['3', '2', 'EAH', 'i g k i', 'VCCVC']
... ]
>>> eval_one(intable, "", False, 1)
0.0

>>> intable = [  # irregular sound correspondences
...   ['ID', 'COGID', 'DOCULECT', 'ALIGNMENT', 'PROSODY'],
...   ['0', '1', 'H', 'k i k i', 'VC'],
...   ['1', '1', 'EAH', 'k i g i', 'VCVCV'],
...   ['2', '2', 'H', 'i k k i', 'VCV'],
...   ['3', '2', 'EAH', 'i g k i', 'VCCVC']
... ]
>>> eval_one(intable, "", False, 2)  # increase rate of false positives
1.0

Loan Finder

This module is designed to identify potential loanwords between a hypothesised donor and recipient language. It processes two input dataframes, one representing the donor language with predicted adapted forms and the other the recipient language with predicted reconstructions. The module first finds phonetic matches between the two languages and then calculates their semantic similarity. The output is a list of candidate loanwords, which can be further analysed manually.

The two functions in this module are responsible for finding phonetic matches between the given donor and recipient language data and calculating their semantic similarity. These functions process the input dataframes and compare the phonetic patterns, as well as calculate the semantic similarity based on a user-provided function. The module returns a list of candidate loanwords that show phonetic and semantic similarities. The output can then be used to propose lexical borrowings, adaptation patterns, and historical reconstructions for words of the proposed donor and recipient language.

loanpy.loanfinder.phonetic_matches(df_rc: List[List[str]], df_ad: List[List[str]], output: Union[str, Path]) None

Find phonetic matches between the given donor and recipient TSV files.

The function processes the donor and recipient data frames, compares the phonetic patterns, and writes the result as a tsv-file to the specified output-path.

Parameters:
  • df_ad (list of lists. Column 2 (index 1) must be a foreign key, and Column 3 (index 2) a predicted loanword adaptation.) – Table of the donor language data with adapted forms.

  • df_rc (list of lists. Column 2 (index 1) must be a foreign key, and Column 3 (index 2) a predicted reconstruction, ideally a regular expression.) – Table of the recipient language data with reconstructed forms.

  • output (str or pathlike object) – The path to the output-file

Returns:

writes a tsv-file containing the matched data, with the following columns: ID – the primary key of the table, ID_rc – the foreign key of the reconstruction, ID_ad – the foreign key of the adaptation.

Return type:

None

Run in Google Colab >>

>>> from loanpy.loanfinder import phonetic_matches
>>> donor = [
...     ['a0', 'Donorese-0', 'igig'],
...     ['a1', 'Donorese-1', 'iggi']
... ]
>>> recipient = [
...     ['0', 'Recipientese-0', '^(i|u)(g)(g)(i|u)$'],
...     ['1', 'Recipientese-1', '^(i|u)(i|u)(g)(g)$']
... ]
>>> outpath = "examples/phonetic_matches.tsv"
>>> phonetic_matches(recipient, donor, outpath)
>>> with open(outpath, "r") as f:
...     print(f.read())
ID      ID_rc   ID_ad
0       Recipientese-0  Donorese-1
loanpy.loanfinder.semantic_matches(df_phonmatch: List[List[str]], get_semsim: Callable[[Any, Any], Union[float, int]], output: Union[str, Path], thresh: Union[int, float] = 0) None

Calculate the semantic similarity between pairs of rows in the input data frame using the provided semantic-similarity scoring function, add the information about the semantic similarity to each row. Write the file to the provided output-path.

Parameters:
  • df_phonmatch (list of lists of strings) – phonetic matches tsv, generated by loanpy.find.phonetic_matches. Each sublist represents a row of data. The first sublist should contain the header row, and each subsequent sublist should contain the data for one row. The meanings have to be in columns 4 and 5 (index 3 and 4).

  • get_semsim (function) – A function that calculates the semantic similarity between two strings.

  • output (str or pathlike object) – The path to the output-file

  • thresh (float, int) – The threshold above which semantic matches count

Returns:

writes a tsv-file representing semantic matches in the input table with the added column of semantic similarity.

Return type:

None

Run in Google Colab >>

>>> from loanpy.loanfinder import semantic_matches
>>> def getsemsim(x, y):
>>>     return 3
>>> phmtsv = [
...     ["ID", "ID_rc", "ID_ad"],
...     ["0", "Recipientese-0", "Donorese-1", "cat", "dog"],
... ]
>>> outpath = "examples/phonetic_matches.tsv"
>>> semantic_matches(phmtsv, getsemsim, outpath)
>>> with open(outpath, "r") as f:
...     print(f.read())
ID      ID_rc   ID_ad   semsim
0       Recipientese-0  Donorese-1      0.75

Utility Functions

Module focusing on functions to support generating and preprocessing loanpy-compatible input data.

This module contains functions for optimal year cutoffs, manipulating IPA data, and processing cognate sets. It provides helper functions for reading and processing linguistic datasets and performing various operations such as filtering and validation.

class loanpy.utils.IPA

Class built on loanpy’s modified version of panphon’s ipa_all.csv table to handle certain tasks that require IPA-data.

Run in Google Colab >>

>>> from loanpy.utils import IPA
>>> ipa = IPA()
>>> type(ipa.vowels)
list
>>> len(ipa.vowels)
1464
>>> ipa.vowels[0]
'ʋ̥'
get_clusters(segments: Iterable[str]) str

Takes a list of phonemes and segments them into consonant and vowel clusters, like so: “abcdeaofgh” -> [“a”, “b.c.d”, “e.a.o”, “f.g.h”]

Parameters:

segments (iterable) – A word, ideally as a list of IPA symbols

Returns:

Same word but with consonants and vowels clustered together

Return type:

str

Run in Google Colab >>

>>> from loanpy.utils import IPA
>>> ipa = IPA()
>>> ipa.get_clusters(["r", "a", "u", "f", "l"])
'r a.u f.l'
get_cv(ipastr: str) str

This method takes an IPA string as input and returns either “V” if the string is a vowel or “C” if it is a consonant.

Parameters:

ipastr (str) – An IPA string representing a phonetic character.

Returns:

A string “V” if the input IPA string is a vowel, or “C” if it is a consonant.

Return type:

str

Run in Google Colab >>

>>> from loanpy.utils import IPA
>>> ipa = IPA()
>>> ipa.get_cv("p")
'C'
>>> ipa.get_cv("u")
'V'
get_prosody(ipastr: str) str

Generate a prosodic string from an IPA string.

This function takes an IPA string as input and generates a prosodic string by classifying each phoneme as a vowel (V) or consonant (C).

Parameters:

ipastr (str) – The tokenised input IPA string. Phonemes must be separated by space or dot.

Returns:

The generated prosodic string.

Return type:

str

Run in Google Colab >>

>>> from loanpy.utils import IPA
>>> ipa = IPA()
>>> ipa.get_prosody("l o l")
'CVC'
>>> ipa.get_prosody("r o f.l")
'CVCC'
loanpy.utils.cvgaps(str1: str, str2: str) List[str]

Replace gaps in the first input string based on the second input string.

This function takes two aligned strings, replaces “-” in the first string with either “C” (consonant) or “V” (vowel) depending on the corresponding character in the second string, and returns the new strings as a list.

Parameters:
  • str1 (str) – The first aligned input string.

  • str2 (str) – The second aligned input string.

Returns:

A list containing the modified first string and the unchanged second string.

Return type:

list of strings

Run in Google Colab >>

>>> from loanpy.utils import cvgaps
>>> cvgaps("b l -", "b l a")
['b l V', 'b l a']
>>> cvgaps("b - a", "b l a")
['b C a', 'b l a']
loanpy.utils.find_optimal_year_cutoff(tsv: List[List[str]], origins: Iterable) int

Determine the optimal year cutoff for a given dataset and origins.

This function reads TSV content from a given dataset and origins, calculates the accumulated count of words with the specified origin until each given year, and finds the optimal year cutoff using the euclidean distance to the upper left corner in a coordinate system where the relative increase of years is on the x-axis and the relative increase in the cumulative number of words is on the y-axis.

Parameters:
  • tsv (list of list of strings) – A table where the first row is the header

  • origins (a set of strings) – A set of origins to be considered for counting words.

Returns:

The optimal year cutoff for the dataset and origins.

Return type:

int

Run in Google Colab >>

>>> from loanpy.utils import find_optimal_year_cutoff
>>> tsv = [
...     ['form', 'sense', 'Year', 'Etymology', 'Loan'],
...     ['gulyás', 'goulash, Hungarian stew', '1800', 'unknown', ''],
...     ['Tisza', 'a major river in Hungary', '1230', 'uncertain', ''],
...     ['Pest', 'part of Budapest, the capital', '1241', 'Slavic', 'True'],
...     ['paprika', 'ground red pepper, spice', '1598', 'Slavic', 'True']
... ]
>>> find_optimal_year_cutoff(tsv, "Slavic")
1241
loanpy.utils.is_same_length_alignments(data: List[List[str]]) bool

Check if alignments within a cognate set have the same length.

This function iterates over the input data and asserts that the alignments within each cognate set have the same length. Alignments are expected to be in column 4 (index 3).

Parameters:

data (list of list of strings) – A list of lists containing language data. No header.

Returns:

True if all alignments within each cognate set have the same length, False otherwise.

Return type:

bool

Run in Google Colab >>

>>> from loanpy.utils import is_same_length_alignments
>>> is_same_length_alignments([[0, 1, 2, "a - c", 4, 5], [0, 1, 2, "d e f", 4, 5]])
True
>>> is_same_length_alignments([[0, 1, 2, "a b c", 4, 5], [0, 1, 2, "d e", 4, 5]])
2023-04-25 23:08:05,042 - INFO - 0
['a', '-', 'c']
['d', 'e']
False
loanpy.utils.is_valid_language_sequence(data: List[List[str]], source_lang: str, target_lang: str) bool

Validate if the data has a valid alternating sequence of source and target language.

The data is expected to have language IDs in the third column (index 2). The sequence should be: source_lang, target_lang, source_lang, target_lang, …

Parameters:
  • data (list) – A list of lists containing language data. No header.

  • source_lang (str) – The expected source language ID.

  • target_lang (str) – The expected target language ID.

Returns:

True if the sequence is valid, False otherwise.

Return type:

bool

Run in Google Colab >>

>>> from loanpy.utils import is_valid_language_sequence
>>> data = [
...     ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'],
...     ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'],
...     ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'],
...     ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'],
...     ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x'],
...     ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x']
... ]
>>> is_valid_language_sequence(data, "de", "en")
True
>>> from loanpy.utils import is_valid_language_sequence
>>> data = [
...     ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'],
...     ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'],
...     ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'],
...     ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'],
...     ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x'],
...     ['x', 'x', 'nl', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x']
... ]
>>> is_valid_language_sequence(data, "de", "en")
2023-04-25 23:04:07,532 - INFO - Problem in row 5
False
loanpy.utils.modify_ipa_all(input_file: Union[str, Path], output_file: Union[str, Path]) None

Original file is ipa_all.csv from folder data in panphon 0.20.0 and was copied with the permission of its author. The ipa_all.csv table of loanpy was created with this function. Following modifications are undertaken:

  1. All + signs are replaced by 1, all - signs by -1

  2. Two phonemes are appended to the column ipa, namely “C”, and “V”, meaning “any consonant”, and “any vowel”.

  3. Any phoneme containing “j”, “w”, or “ʔ” is redefined as a consonant.

Parameters:
  • input_file (A string or a path-like object) – The path to the file ipa_all.csv.

  • output_file (A string or a path-like object) – The name and path of the new csv-file that is to be written.

Returns:

Write new file

Return type:

None

loanpy.utils.prefilter(data: List[List[str]], srclg: str, tgtlg: str) List[List[str]]

Filter dataset to keep only cognate sets where both source and target languages occur.

This function filters the input dataset to retain only the cognate sets where both source and target languages are present. The filtered dataset is then sorted based on cognate set ID and language ID.

Parameters:
  • data (list of list of strings) – A list of lists containing language data. Columns Language_ID and Cognacy must be provided.

  • srclg (str) – The source language ID to be considered.

  • tgtlg (str) – The target language ID to be considered.

Returns:

A filtered and sorted list of lists containing cognate sets with both source and target languages present.

Return type:

list of list of strings

Run in Google Colab >>

>>> from loanpy.utils import prefilter
>>> data = [
...     ['x', 'x', 'Language_ID', 'x', 'x', 'x', 'x', 'x', 'x', 'Cognacy', 'x'],
...     ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'],
...     ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'],
...     ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'],
...     ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'],
...     ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '2', 'x'],
...     ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '3', 'x'],
...     ['x', 'x', 'nl', 'x', 'x', 'x', 'x', 'x', 'x', '4', 'x'],
...     ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '4', 'x'],
...     ['x', 'x', 'nl', 'x', 'x', 'x', 'x', 'x', 'x', '5', 'x'],
...     ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '5', 'x'],
...     ['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x'],
...     ['x', 'x', 'nl', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x'],
...     ['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x']
... ]
>>> prefilter(data, "de", "en")
[['x', 'x', 'Language_ID', 'x', 'x', 'x', 'x', 'x', 'x', 'Cognacy', 'x'],
['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'],
['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '0', 'x'],
['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'],
['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '1', 'x'],
['x', 'x', 'de', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x'],
['x', 'x', 'en', 'x', 'x', 'x', 'x', 'x', 'x', '6', 'x']]
loanpy.utils.prod(iterable: Iterable[Union[int, float]]) Union[int, float]

Calculate the product of all elements in an iterable.

This function takes an iterable (e.g., list, tuple) as input and computes the product of all its elements. This function had to be hard-coded because from math import prod caused incompatibility issues with some python versions on certain platforms.

Parameters:

iterable (Iterable[int] or Iterable[float]) – The input iterable containing numbers.

Returns:

The product of all elements in the input iterable.

Return type:

int or float

Run in Google Colab >>

>>> from loanpy.utils import prod
>>> prod([1, 2, 3])  # one times two times three
6
loanpy.utils.read_ipa_all() List[List[str]]

This function reads the ipa_all.csv table located in the same directory as the loanpy-modules and returns it as a list of lists.

Returns:

A list of lists containing IPA data read from ipa_all.csv.

Return type:

list of list of strings

Run in Google Colab >>

>>> from loanpy.utils import read_ipa_all
>>> ipa_all = read_ipa_all()
>>> type(ipa_all)
list
>>> len(ipa_all)
6492
>>> ipa_all[:2]
[['ipa', 'syl', 'son', 'cons', 'cont', 'delrel', 'lat', 'nas',
'strid', 'voi', 'sg', 'cg', 'ant', 'cor', 'distr', 'lab', 'hi', 'lo',
'back', 'round', 'velaric', 'tense', 'long', 'hitone', 'hireg'],
['˩', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0',
'0', '0', '0', '0', '0', '0', '0', '0', '0', '-1', '-1']]
loanpy.utils.scjson2tsv(jsonin: Union[str, Path], outtsv: Union[str, Path], outtsv_phonotactics: Union[str, Path]) None

Turn a computer-readable sound correspondence json-file into a human readbale tab separated value file (tsv).

  1. read json

  2. put information into columns

  3. write file

Parameters:
  • jsonin (str or path-like object) – The name of the json-file containing the sound correspondences to be converted

  • outtsv (str or path-like object) – The name of the output file containing the sound correspondences. Should end in “.tsv”.

  • outtsv_phonotactics – The name of the output file containing the phonotactic (=prosodic) correspondences. Should end in “.tsv”.

Returns:

Write two tsv-files to the specified two output paths

Return type:

None

Run in Google Colab >>

>>> from loanpy.utils import scjson2tsv
>>> scjson2tsv("sc.json", "sc.tsv", "sc_p.tsv")
>>> with open("sc.tsv", "r") as f:
...     print(f.read())
sc      src     tgt     freq    CogID
a o a   o       1       512
a e     a       e       2       3, 4
>>> with open("sc_p.tsv", "r") as f:
...     print(f.read())
sc      src     tgt     freq    CogID
CV CV   CV      CV      1       7