Skip to content

Commit

Permalink
method for calculating term frequencies
Browse files Browse the repository at this point in the history
fixes #28
  • Loading branch information
jtauber committed Feb 7, 2024
1 parent d675603 commit 6a5a78d
Show file tree
Hide file tree
Showing 3 changed files with 33 additions and 1 deletion.
18 changes: 17 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,6 @@ Entire lists of tokens can be added for a particular address in one go using `ad

```


You can **prune** a HTDM to just `n` levels with the method `prune(n)`.

You can iterate over the document-term counts at the leaves of the HTDM with the method `leaf_entries()` (this returns a generator yielding `(document_address, term, count)` tuples). This is effectively a traditional TDM (the document IDs will still reflect the hierarchy but the aggregate counts aren't present).
Expand All @@ -133,6 +132,23 @@ You can deep copy an HTDM with `copy()`. You can also pass a prefix to `copy()`

You can save out an HTDM with `save()` which takes a `filename` and optional `field_sep` (defaulting to tab) and `prefix` (if you just want to save out a subtree).

### Calculations

You can get a term frequency with `tf(term)` or `tf(term, address)`.

```python
>>> c = termdoc.HTDM()
>>> c.increment_count("1", "foo")
>>> c.increment_count("1", "bar", 3)
>>> c.increment_count("2", "foo", 3)
>>> c.increment_count("2", "bar")
>>> c.tf("foo")
0.5
>>> c.tf("foo", "2")
0.75

```

### Duplicates Policy

You can optionally pass in a `duplicates` setting to the constructor indicating the policy you want to follow if a term-document count is updated more than once.
Expand Down
3 changes: 3 additions & 0 deletions termdoc/htdm.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,6 @@ def copy(self, prefix=None):
for document, term, count in self.leaf_entries(prefix):
new.increment_count(document, term, count)
return new

def tf(self, term, address=""):
return self.get_counts(address)[term] / self.get_counts(address).total()
13 changes: 13 additions & 0 deletions tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -347,5 +347,18 @@ def test_add(self):
self.assertEqual(c.get_counts("1")["bar"], 2)


class Test4(unittest.TestCase):
def test_term_frequency(self):
import termdoc

c = termdoc.HTDM()
c.increment_count("1", "foo")
c.increment_count("1", "bar", 3)
c.increment_count("2", "foo", 3)
c.increment_count("2", "bar")
self.assertEqual(c.tf("foo"), 0.5)
self.assertEqual(c.tf("foo", "2"), 0.75)


if __name__ == "__main__":
unittest.main()

0 comments on commit 6a5a78d

Please sign in to comment.