Grabbing String From Dynamic Content
Few days ago I was asked to debug someone’s code. The problem lies in a small function that contains around 5 lines of code. Its purpose is to grab an URL from a src attribute in an img tag, within an RSS feed. This is a PHP project. It was done using a mixture of substring and str_replace. First, the tag name and portion of the attribute are replaced by empty string e.g str_replace('<img src="', '', $input). Finally, a substr is used to extract the URL.
When the application was not working, I was asked to help debug it. In the end, I found out that there where some extra white spaces in front of the extracted URL string. In the end, we fixed it by using applying a trim function. I dislike this solution.
In this particular situation, the task is to extract an image URL from a RSS feed. The best solution is to use an XML parser, such as SimpleXMLElement, to locate and extract the attribute value. Using simple string searching functions is bad, because a slight change in the input can easily cause a bug. A XML parser can be used to extract content accurately, even if there are irregular spacing, and minor change in tags arrangement.
For unstructured text, regular expression is a good alternative solution. The sad thing is, many programmers do not know regular expression. Regular expression may be hard to learn, but it is an extremely powerful tool for string searching! I am not going to talk about why regular expressions are so powerful, there are many articles on this topic already.
NOTE: I am not implying that regular expressions and XML parser are the best solutions to all string searching problems. It depends on the requirements. Although The PHP’s native string searching functions are less flexible, but they are generally much faster then regular expressions. When making a decision, I will consider the performance, and how structured the input is.