Skip to content

Commit 7c0698a

Browse files
committed
feat(algorithms, dp, memoization): word break dp with memoization
1 parent 09786c0 commit 7c0698a

File tree

3 files changed

+297
-12
lines changed

3 files changed

+297
-12
lines changed

algorithms/dynamic_programming/word_break/README.md

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,4 +233,173 @@ the `dp` array.
233233

234234
### Dynamic Programming - Memoization
235235

236+
We can improve the efficiency of the backtracking method by using Memoization, which stores the results of subproblems
237+
to avoid recalculating them.
238+
239+
We use a depth-first search (DFS) function that recursively breaks the string into words. However, before performing a
240+
recursive call, we check if the results for the current substring have already been computed and stored in a memoization
241+
map (typically a dictionary or hash table).
242+
243+
If the results of the current substring are found in the memoization map, we can directly return them without further
244+
computation. If not, we proceed with the recursive call, computing the results and storing them in the memoization map
245+
before returning them.
246+
247+
By memoizing the results, we can reduce the number of computations by ensuring that each substring is processed only
248+
once in average cases.
249+
250+
#### Algorithm
251+
252+
1. Convert the `wordDict` array into an unordered set `wordSet` for efficient lookups.
253+
2. Initialize an empty unordered map `memoization` to store the results of subproblems.
254+
3. Call the `dfs` function with the input string s, wordSet, and memoization.
255+
- Check if the answer for the current `remainingSt`r(the remaining part of the string to be processed) are already
256+
in `memoization`. If so, return them.
257+
- Base Case: If `remainingStr` is empty, it means that all characters have been processed. An empty string represents
258+
a valid sentence so return an array containing the empty string.
259+
- Initialize an empty array `results`.
260+
- Iterate from 1 to the length of `remainingStr`:
261+
- Extract the substring `currentWord` from 0 to i to check if it is a valid word.
262+
- If currentWord is found in `wordSet`:
263+
- Recursively call `dfs` with `remainingStr.substr(i)`, wordSet, and memoization.
264+
- Append currentWord and the recursive results to `results`(with a space if needed) to form valid sentences.
265+
- Store the `results` for `remainingStr` in memoization.
266+
- Return `results`.
267+
268+
#### Complexity
269+
270+
Let n be the length of the input string.
271+
272+
##### Time complexity: O(n⋅2^n)
273+
274+
While memoization avoids redundant computations, it does not change the overall number of subproblems that need to be
275+
solved. In the worst case, there are still unique 2^n possible substrings that need to be explored, leading to an
276+
exponential time complexity. For each subproblem, O(n) work is performed, so the overall complexity is O(n⋅2^n).
277+
278+
##### Space complexity: O(n⋅2^n)
279+
280+
The recursion stack can grow up to a depth of n, where each recursive call consumes additional space for storing the
281+
current state.
282+
283+
The memoization map needs to store the results for all possible substrings, which can be up to 2^n substrings of size n
284+
in the worst case, resulting in an exponential space complexity.
285+
236286
### Trie Optimization
287+
288+
While the previous approaches focus on optimizing the search and computation process, we can also consider leveraging
289+
efficient data structures to enhance the word lookup process. This leads us to the trie-based approach, which uses a
290+
trie data structure to store the word dictionary, allowing efficient word lookup and prefix matching.
291+
292+
The trie, also known as a prefix tree, is a tree-based data structure where each node represents a character in a word,
293+
and the path from the root to a leaf node represents a complete word. This structure is particularly useful for problems
294+
involving word segmentation because it allows for efficient prefix matching.
295+
296+
Here, we first build a trie from the dictionary words. Each word is represented as a path in the trie, where each node
297+
corresponds to a character in the word.
298+
299+
By using the trie, we can quickly determine whether a substring can form a valid word without having to perform linear
300+
searches or set lookups. This reduces the search space and improves the efficiency of the algorithm.
301+
302+
In this approach, instead of recursively exploring the remaining substring and using memoization, we iterate from the
303+
end of the input string to the beginning (in reverse order). For each starting index (startIdx), we attempt to find
304+
valid sentences that can be formed from that index by iterating through the string and checking if the current substring
305+
forms a valid word using the trie data structure.
306+
When a valid word is encountered in the trie, we append it to the list of valid sentences for the current starting index.
307+
If the current valid word is not the last word in the sentence, we combine it with the valid sentences formed from the
308+
next index (endIdx + 1), which are retrieved from the dp dictionary.
309+
310+
The valid sentences for each starting index are stored in the dp dictionary, ensuring that previously computed results
311+
are reused. By using tabulation and storing the valid sentences for each starting index, we avoid redundant computations
312+
and achieve significant time and space efficiency improvements compared to the standard backtracking method with
313+
memoization.
314+
315+
The trie-based approach offers advantages in terms of efficient word lookup and prefix matching, making it particularly
316+
suitable for problems involving word segmentation or string manipulation. However, it comes with the additional overhead
317+
of constructing and maintaining the trie data structure, which can be more memory-intensive for large dictionaries.
318+
319+
#### Algorithm
320+
321+
##### Initialize TrieNode Structure
322+
323+
- Each TrieNode has two properties:
324+
- isEnd: A boolean value indicating if the node marks the end of a word.
325+
- children: An array of size 26 (for lowercase English letters) to store pointers to child nodes.
326+
- The constructor initializes isEnd to false and all elements in children to null.
327+
328+
##### Trie Class
329+
330+
- The Trie class has a root pointer of type TrieNode.
331+
- The constructor initializes the root with a new TrieNode object.
332+
- The insert function:
333+
- Takes a string word as input.
334+
- Starts from the root node.
335+
- For each character c in the word:
336+
- Calculate the index corresponding to the character.
337+
- If the child node at the calculated index doesn't exist, create a new TrieNode and assign it to that index.
338+
- Move to the child node.
339+
- After processing all characters, mark the current node's isEnd as true
340+
341+
##### `wordBreak` Function
342+
343+
- Create a Trie object.
344+
- Insert all words from wordDict into the trie using the insert function.
345+
- Initialize a map dp to store the results of subproblems.
346+
- Iterate from the end of the string s to the beginning (in reverse order).
347+
- For each starting index startIdx:
348+
- Initialize a vector validSentences to store valid sentences starting from startIdx.
349+
- Initialize a current_node pointer to the root of the trie.
350+
- Iterate from startIdx to the end of the string.
351+
- For each character c in the string:
352+
- Calculate the index corresponding to c.
353+
- Check if the child node at the calculated index exists in the trie.
354+
- If the child node doesn't exist, break out of the inner loop. This means that the current substring cannot form
355+
a valid word, so there is no need to continue checking the remaining characters.
356+
- Move to the child node.
357+
- Check if the current node's isEnd is true, indicating a valid word.
358+
- If a valid word is found:
359+
- Extract the current word from the string using substr.
360+
- If it's the last word in the sentence (endIdx is the last index):
361+
- Add the current word to validSentences.
362+
- If it's not the last word:
363+
- Retrieve the valid sentences formed by the remaining substring from dp[endIdx + 1].
364+
- Combine the current word with each sentence and add it to validSentences.
365+
- Store the validSentences for the current startIdx in dp.
366+
- Return the valid sentences stored in dp[0], which represents the valid sentences formed from the entire string.
367+
368+
#### Complexity Analysis
369+
370+
Let n be the length of the input string.
371+
372+
##### Time complexity: O(n⋅2^n)
373+
374+
Even though the trie-based approach uses an efficient data structure for word lookup, it still needs to explore all
375+
possible ways to break the string into words. In the worst case, there are 2^n unique possible partitions, leading to
376+
an exponential time complexity. O(n) work is performed for each partition, so the overall complexity is O(n⋅2^n).
377+
378+
##### Space complexity: O(n⋅2^n)
379+
380+
The trie data structure itself can have a maximum of 2^n nodes in the worst case, where each character in the string
381+
represents a separate word. Additionally, the tabulation map used in this approach can also store up to 2^n strings of
382+
size n, resulting in an overall exponential space complexity.
383+
384+
----
385+
386+
### Further Thoughts on Complexity Analysis
387+
388+
The complexity of this problem cannot be reduced from n⋅2^n; the worst-case scenario will still be (n⋅2^n). However,
389+
using dynamic programming (DP) will make it a bit more efficient than backtracking overall because of the below test case.
390+
391+
Consider the input "aaaaaa", with wordDict = ["a", "aa", "aaa", "aaaa", "aaaaa", "aaaaa"].
392+
393+
Every possible partition is a valid sentence, and there are 2^(n−1) such partitions. The algorithms cannot perform
394+
better than this since they must generate all valid sentences. The cost of iterating over cached results will be
395+
exponential, as every possible partition will be cached, resulting in the same runtime as regular backtracking.
396+
Likewise, the space complexity will also be O(n⋅2^n) for the same reason—every partition is stored in memory.
397+
398+
Another way to explain why the worst-case complexity is O(n⋅2^n) for all the algorithms is that, given an array of
399+
length n, there are n+1 ways/intervals to partition it into two parts. Each interval has two choices: to split or not
400+
to split. In the worst case, we will have to check all possibilities, which results in a time complexity of O(n⋅2^(n+1)),
401+
which simplifies to O(n⋅2^n). This analysis is extremely similar to palindrome partitioning.
402+
403+
Overall, this question is interesting because of the nature of this complexity. In an interview setting, if an
404+
interviewer asks this question, the most expected solutions would be Backtracking and Trie, as they become natural
405+
choices for the conditions and outputs we need.

algorithms/dynamic_programming/word_break/__init__.py

Lines changed: 66 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ def word_break_trie(s: str, word_dict: List[str]) -> List[str]:
6666
return results.get(0, [])
6767

6868

69-
def word_break_dp(s: str, word_dict: List[str]) -> List[str]:
69+
def word_break_dp_tabulation(s: str, word_dict: List[str]) -> List[str]:
7070
"""
7171
This adds spaces to s to break it up into a sequence of valid words from word_dict.
7272
@@ -112,7 +112,7 @@ def word_break_dp(s: str, word_dict: List[str]) -> List[str]:
112112
return dp[len(s)]
113113

114114

115-
def word_break_dp_2(s: str, word_dict: List[str]) -> List[str]:
115+
def word_break_dp_tabulation_2(s: str, word_dict: List[str]) -> List[str]:
116116
"""
117117
This adds spaces to s to break it up into a sequence of valid words from word_dict.
118118
@@ -160,6 +160,62 @@ def word_break_dp_2(s: str, word_dict: List[str]) -> List[str]:
160160
return dp.get(0, [])
161161

162162

163+
def word_break_dp_memoization(s: str, word_dict: List[str]) -> List[str]:
164+
"""
165+
This adds spaces to s to break it up into a sequence of valid words from word_dict.
166+
167+
This uses dynamic programming with memoization to store the words in the dictionary and a map to store the results
168+
of subproblems.
169+
170+
Complexity:
171+
Time: O(n*2^n): where n is the length of the string
172+
Space: O(n*2^n): where n is the length of the string
173+
174+
Args:
175+
s: The input string
176+
word_dict: The dictionary of words
177+
Returns:
178+
List of valid sentences
179+
"""
180+
word_set: Set[str] = set(word_dict)
181+
memoization: Dict[str, List[str]] = dict()
182+
183+
def dfs(remaining_str: str, words_set: Set[str], memo: Dict) -> List[str]:
184+
"""
185+
Depth-first search to find all possible word combinations
186+
Args:
187+
remaining_str(str): the remaining string to search through
188+
words_set(set): set of dictionary words to use to construct sentences
189+
memo(dict): dictionary to improve computation of already processed words
190+
Returns:
191+
list: possible word combinations
192+
"""
193+
# check if the result for this substring is already memoized
194+
if remaining_str in memo:
195+
return memo[remaining_str]
196+
197+
# base case: when the string is empty, return a list containing an empty string
198+
if not remaining_str:
199+
return [""]
200+
201+
results = []
202+
for i in range(1, len(remaining_str) + 1):
203+
current_word = remaining_str[:i]
204+
# if the current substring is a valid word in the word set
205+
if current_word in words_set:
206+
for next_word in dfs(remaining_str[i:], words_set, memo):
207+
# append current word and next word
208+
results.append(
209+
f"{current_word}{" " + next_word if next_word else ""}"
210+
)
211+
212+
# memoize the results for the current substring
213+
memo[remaining_str] = results
214+
return results
215+
216+
return dfs(s, word_set, memoization)
217+
218+
163219
def word_break_backtrack(s: str, word_dict: List[str]) -> List[str]:
164220
"""
165221
This adds spaces to s to break it up into a sequence of valid words from word_dict.
@@ -176,7 +232,13 @@ def word_break_backtrack(s: str, word_dict: List[str]) -> List[str]:
176232
word_set = set(word_dict)
177233
results = []
178234

179-
def backtrack(sentence: str, words_set: Set[str], current_sentence: List[str], result: List[str], start_index: int):
235+
def backtrack(
236+
sentence: str,
237+
words_set: Set[str],
238+
current_sentence: List[str],
239+
result: List[str],
240+
start_index: int,
241+
):
180242
# If we've reached the end of the string, add the current sentence to results
181243
if start_index == len(sentence):
182244
result.append(" ".join(current_sentence))
@@ -189,9 +251,7 @@ def backtrack(sentence: str, words_set: Set[str], current_sentence: List[str], r
189251
if word in words_set:
190252
current_sentence.append(word)
191253
# Recursively call backtrack with the new end index
192-
backtrack(
193-
sentence, words_set, current_sentence, result, end_index
194-
)
254+
backtrack(sentence, words_set, current_sentence, result, end_index)
195255
# Remove the last word to backtrack
196256
current_sentence.pop()
197257

algorithms/dynamic_programming/word_break/test_word_break.py

Lines changed: 62 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,13 @@
11
import unittest
22
from typing import List
33
from parameterized import parameterized
4-
from algorithms.dynamic_programming.word_break import word_break_trie, word_break_dp, word_break_dp_2, word_break_backtrack
4+
from algorithms.dynamic_programming.word_break import (
5+
word_break_trie,
6+
word_break_dp_tabulation,
7+
word_break_dp_tabulation_2,
8+
word_break_backtrack,
9+
word_break_dp_memoization,
10+
)
511

612

713
class WordBreakTestCases(unittest.TestCase):
@@ -83,8 +89,10 @@ def test_word_break_trie(self, s: str, word_dict: List[str], expected: List[str]
8389
("catsandog", ["cats", "dog", "sand", "and", "cat"], []),
8490
]
8591
)
86-
def test_word_break_dp(self, s: str, word_dict: List[str], expected: List[str]):
87-
actual = word_break_dp(s, word_dict)
92+
def test_word_break_dp_tabulation(
93+
self, s: str, word_dict: List[str], expected: List[str]
94+
):
95+
actual = word_break_dp_tabulation(s, word_dict)
8896
actual.sort()
8997
expected.sort()
9098
self.assertListEqual(expected, actual)
@@ -125,8 +133,10 @@ def test_word_break_dp(self, s: str, word_dict: List[str], expected: List[str]):
125133
("catsandog", ["cats", "dog", "sand", "and", "cat"], []),
126134
]
127135
)
128-
def test_word_break_dp_2(self, s: str, word_dict: List[str], expected: List[str]):
129-
actual = word_break_dp_2(s, word_dict)
136+
def test_word_break_dp_tabulation_2(
137+
self, s: str, word_dict: List[str], expected: List[str]
138+
):
139+
actual = word_break_dp_tabulation_2(s, word_dict)
130140
actual.sort()
131141
expected.sort()
132142
self.assertListEqual(expected, actual)
@@ -167,12 +177,58 @@ def test_word_break_dp_2(self, s: str, word_dict: List[str], expected: List[str]
167177
("catsandog", ["cats", "dog", "sand", "and", "cat"], []),
168178
]
169179
)
170-
def test_word_break_backtrack(self, s: str, word_dict: List[str], expected: List[str]):
180+
def test_word_break_backtrack(
181+
self, s: str, word_dict: List[str], expected: List[str]
182+
):
171183
actual = word_break_backtrack(s, word_dict)
172184
actual.sort()
173185
expected.sort()
174186
self.assertListEqual(expected, actual)
175187

188+
@parameterized.expand(
189+
[
190+
(
191+
"magiclly",
192+
["ag", "al", "icl", "mag", "magic", "ly", "lly"],
193+
["mag icl ly", "magic lly"],
194+
),
195+
(
196+
"raincoats",
197+
["rain", "oats", "coat", "s", "rains", "oat", "coats", "c"],
198+
["rain c oats", "rain c oat s", "rain coats", "rain coat s"],
199+
),
200+
(
201+
"highway",
202+
["crash", "cream", "high", "highway", "low", "way"],
203+
["highway", "high way"],
204+
),
205+
("robocat", ["rob", "cat", "robo", "bo", "b"], ["robo cat"]),
206+
(
207+
"cocomomo",
208+
["co", "mo", "coco", "momo"],
209+
["co co momo", "co co mo mo", "coco momo", "coco mo mo"],
210+
),
211+
(
212+
"catsanddog",
213+
["cat", "cats", "and", "sand", "dog"],
214+
["cats and dog", "cat sand dog"],
215+
),
216+
(
217+
"pineapplepenapple",
218+
["apple", "pen", "applepen", "pine", "pineapple"],
219+
["pine apple pen apple", "pineapple pen apple", "pine applepen apple"],
220+
),
221+
("catsandog", ["cats", "dog", "sand", "and", "cat"], []),
222+
]
223+
)
224+
def test_word_break_dp_memoization(
225+
self, s: str, word_dict: List[str], expected: List[str]
226+
):
227+
actual = word_break_dp_memoization(s, word_dict)
228+
actual.sort()
229+
expected.sort()
230+
self.assertListEqual(expected, actual)
231+
176232

177233
if __name__ == "__main__":
178234
unittest.main()

0 commit comments

Comments
 (0)