Thursday, 3 July 2014

Strip Punctuation from String in Python along with Time Efficiency Analysis

You might have googled on how to remove punctuation characters in python and you must have came across several posts on StackOverflow about the different ways you can do it. But, do you know which one is the best or which one is most time-efficient? When you are scaling your applications to large databases, then you need to think from these perspectives in order to save some computation time.

This blog post will discuss 3 different methods of string punctuation along with a comparison of their computation time.

The 3 different methods are:
  1. In-built "String Translate" function : This method is the most time-efficient. However, it strips some of the essential punctuation.
    e.g.: It will convert 'didn't' to 'didnt' and now the word doesn't make sense at all.
  2. Splitting the word and then using 'String Punctuation': It splits the whole string and then strips the punctuation from each word.
  3. Using Regular Expression Substitution: It is a simple regular expression substitution.
Example:

review_string=""I'd give Ardor a TWO THUMBS UP!\nLove the food.. esp the Indian and total value for money!\nThe drinks are well made and the food is to die for. \nI've been there about 5 times - mix of at night and for lunch.\nDuring the night, its like the new watering hole for the young crowd, Love the energy it has. \nThe ambiance is great.\nFor those of you who have not been there yet, what are you waiting for?!\n\nI'll totally recommend this place to anyone!'give'"

1. Using String.Translate

Code: review_string.translate(None, string.punctuation)
Time: 1000000 loops, best of 3: 1.53 µs per loop
Result:
'Id give Ardor a TWO THUMBS UP\nLove the food esp the Indian and total value for money\nThe drinks are well made and the food is to die for \nIve been there about 5 times mix of at night and for lunch\nDuring the night its like the new watering hole for the young crowd Love the energy it has \nThe ambiance is great\nFor those of you who have not been there yet what are you waiting for\n\nIll totally recommend this place to anyonegive'




2. Using String Punctuation and Word Splitting

Code: ' '.join(word.strip(string.punctuation) for word in review_string.split())
Time: 10000 loops, best of 3: 43.8 µs per loop
Result:
"I'd give Ardor a TWO THUMBS UP Love the food esp the Indian and total value for money The drinks are well made and the food is to die for I've been there about 5 times  mix of at night and for lunch During the night its like the new watering hole for the young crowd Love the energy it has The ambiance is great For those of you who have not been there yet what are you waiting for I'll totally recommend this place to anyone!'give"




3. Using Regular Expression Substitution

Code: 
p = re.compile(r'(\n)|(\r)|(\t)|(\')|(\u00A9)|([!"#$%&()*+,-./:;<=>?@\[\\\]^_`{|}~])', re.IGNORECASE) re.sub(p,'',review_string)

Time: 10000 loops, best of 3: 101 µs per loop
Result:
"I'd give Ardor a TWO THUMBS UP!\nLove the food.. esp the Indian and total value for money!\nThe drinks are well made and the food is to die for. \nI've been there about 5 times - mix of at night and for lunch.\nDuring the night, its like the new watering hole for the young crowd, Love the energy it has. \nThe ambiance is great.\nFor those of you who have not been there yet, what are you waiting for?!\n\nI'll totally recommend this place to anyone!'give'"


Conclusion :
String.Translate is quite fast since it is built on C-module, but you can't modify it as per your needs. So you can either go for Word-Stripping, or else, if you want too specific stripping, then go for regular expression substitution.