Chapter-2:String Data

String Constants

Strings are a collection of characters which are stored together to represent arbitrary text inside a python program. You can create a string constant inside a python program by surrounding text with either single quotes (’), double quotes ("), or a collection of three of either types of quotes (’’’ or """). In the first two cases, the opening and closing quotes must appear on the same line in your program; when you use triple quotes, your text can span as many lines as you like. The choice of which quote symbol to use is up to you – both single and double quotes have the same meaning in python.

Here are a few examples of how to create a string constant and assign its value to a variable:

name = ’Phil’
value = "$7.00"
helptext = """You can create long strings of text
spanning several lines by using triple quotes at
the beginning and end of the text"""

When the variable helptext is printed, it would display as three lines, with the line breaks at the same points as in the triple-quoted text.

You can also create strings by reading input from a file or by concatenating smaller strings.

Special Characters and Raw Strings:

There are special character sequences, beginning with a backslash (\), which are interpreted in a special way.Using a single backslash as a continuation character is an alternative to using triple quoted strings when you are constructing a string constant. Thus, the following two expressions are equivalent, but most programmers prefer the convenience of not having to use backslashes which is offered by triple quotes.

threelines = ’First\
Second\
Third’
threelines = ’’’First
Second
Third’’’

The backslashed quote symbols are useful if you need to create a string with both single and double quotes. (If you have only one kind of quotes in your string, you can simply use the other kind to surround your string, since the two types of quotes are equivalent in python.)you can produce a backslash in a string by typing two backslashes; note that only one of the backslashes will actually appear in the string when it’s printed. There are certain situations (most notably when constructing regular expressions , when typing two backslashes to get a single backslash becomes tedious. Python provides what are called raw strings, in which the character sequences shown in Table 2.1 have no special meaning. To construct a raw string, precede the opening quote character with either a lowercase or uppercase “R” (r or R). Note, however that a backslash cannot be the very last character of a raw string. Thus, these two expressions are equivalent:

>>> print ’Here is a backslash: \\ ’
Here is a backslash: \
>>> print r’Here is a backslash: \ ’
Here is a backslash: \

Unicode Strings

Starting with version 2.0, python provides support for Unicode strings, whose characters are stored in 16 bits instead of the 8 bits used by a normal string. To specify that a string should be stored using this format, precede the opening quote character with either a lowercase or uppercase “U”. In addition, an arbitrary Unicode character can be specified with the notation “\uhhhh”, where hhhh represents a four-digit hexadecimal number. Notice that if a unicode string is combined with a regular string, the resulting string will also be a Unicode string.

String Operations:

Concatenation:

The addition operator (+) takes on the role of concatenation when used with strings, that is, the result of “adding” two strings together is a new string which consists of the original strings put together:

>>> first = ’black’
>>> second = ’jack’
>>> first + second
’blackjack’

No space or other character is inserted between concatenated strings. If you do want a space, you can simply add it:

>>> first + " " + second
’black jack’

When the strings you want to concatenate are literal string constants, that is, pieces of text surrounded by any of the quotes that python accepts, you don’t even need the plus sign; python will automatically concatenate string constants that are separated by whitespace:

>>> nums = "one " "two " "three "
>>> nums
’one two three ’

You can freely mix variables and string constants together — anywhere that python expects a string, it can be either a variable or a constant:

>>> msg = """Send me email
... My address is """
>>> msg + "[email protected]"
’Send me email\012My address is [email protected]’

Notice that the newline which was entered in the triple quoted string appears as \012, the octal representation of the (non-printable) newline character. When you use the print command, the newline is displayed as you would expect:

>>> print msg + ’[email protected]’
Send me email
My address is [email protected]

Remember that python considers the type of an object when it tries to apply an operator, so that if you try to concatenate a string and a number, you’ll have problems:

>>> x = 12./7.
>>> print ’The answer is ’ + x
Traceback (innermost last):
File "<stdin>", line 1, in ?
TypeError: illegal argument type for built-in operation

The number (x) must first be converted to a string before it can be concatentated. Python provides two ways to do this: the core function repr, or the backquote operator (‘‘). The following example shows both of these techniques used to solve the problem of concatenating strings and numbers:

>>> print ’The answer is ’ + repr(x)
The answer is 1.71428571429
>>> print ’The answer is ’ + ‘x‘
The answer is 1.71428571429

Notice that python uses its default of 12 significant figures; if you want more control over the way numbers are converted to strings where the percent operator (%) for string formatting is introduced. When you want to concatenate lots of strings together, the join method for strings or the join function in the string module are more convenient.

Repetition:

The asterisk (*), when used between a string and an integer creates a new string with the old string repeated by the value of the integer. The order of the arguments is not important. So the following two statements will both print ten dashes:

>>> ’-’ * 10
’----------’
>>> 10 * ’-’
’----------’

Trying to use the repetition operator between two strings results in a Type Error exception.

Indexing and Slicing:

Strings in python support indexing and slicing. To extract a single character from a string, follow the string with the index of the desired character surrounded by square brackets ([ ]), remembering that the first character of a string has index zero.

>>> what = ’This parrot is dead’
>>> what[3]
’s’
>>> what[0]
’T’

If the subscript you provide between the brackets is less than zero, python counts from the end of the string, with a subscript of -1 representing the last character in the string.

>>> what[-1]
’d’

To extract a contiguous piece of a string (known as a slice), use a subscript consisting of the starting position followed by a colon (:), finally followed by one more than the ending position of the slice you want to extract. Notice that the slicing stops immediately before the second value:

>>> what[0:4]
’This’
>>> what[5:11]
’parrot’

One way to think about the indexes in a slice is that you give the starting position as the value before the colon, and the starting position plus the number of characters in the slice after the colon. For the special case when a slice starts at the beginning of a string, or continues until the end, you can omit the first or second index, respectively. So to extract all but the first character of a string, you can use a subscript of 1: .

>>> what[1:]
’his parrot is dead’

To extract the first 3 characters of a string you can use :3 .

>>> what[:3]
’Thi’

If you use a value for a slice index which is larger than the length of the string, python does not raise an exceptrion, but treats the index as if it was the length of the string. As always, variables and integer constants can be freely mixed:

>>> start = 3
>>> finish = 8
>>> what[start:finish]
’s par’
>>> what[5:finish]
’par’

Using a second index which is less than or equal to the first index will result in an empty string. If either index is not an integer, a TypeError exception is raised unless, of course, that index was omitted.

Functions and Methods for Character Strings:

The core language provides only one function which is useful for working with strings; the len function, which returns the number of characters which a character string contains. In versions of Python earlier than 2.0, tools for working with strings were provided by the string module . Starting with version 2.0, strings in python became “true” objects, and a variety of methods were introduced to operate on strings. If you find that the string methods described in this section are not available with your version of python, for equivalent capabilities through the string module. (Note that on some systems, a newer version of Python may be available through the name python2.)

Since strings are the first true objects we’ve encountered a brief description of methods is in order. As mentioned earlier , when dealing with objects, functions are known as methods. Besides the terminology, methods are invoked slightly differently than functions. When you call a function like len, you pass the arguments in a comma separated list surrounded by parentheses after the function name. When you invoke a method, you provide the name of the object the method is to act upon, followed by a period, finally followed by the method name and the parenthesized list of additional arguments. Remember to provide empty parentheses if the method does not take any arguments, so that python can distinguish a method call with no arguments from a reference to a variable stored within the object.

Strings in python are immutable objects; this means that you can’t change the value of a string in place. If you do want to change the value of a string, you need to invoke a method on the variable containing the string you wish to change, and to reassign the value of that operation to the variable in question, as some of the examples below will show.

Many of the string methods provided by python. Among the most useful are the methods split and join. The split method operates on a string, and returns a list, each of whose elements is a word in the original string, where a word is defined by default as a group of nonwhitespace characters, joined by one or more whitespace characters. If you provide one optional argument to the split method, it is used to split the string as an alternative to one or more whitespace characters. Note the subtle difference between invoking split with no arguments, and an argument consisting of a single blank space:

>>> str = ’This parrot is dead’
>>> str.split()
[’This’, ’parrot’, ’is’, ’dead’]
>>> str.split(’ ’)
[’This’, ’parrot’, ’’, ’is’, ’dead’]

When more than one space is encountered in the string, the default method treats it as if it were just a single space, but when we explicitly set the separator character to a single space, multiple spaces in the string result in extra elements in the resultant list. You can also obtain the default behavior for split by specifying None for the sep argument. The maxsplit argument to the split method will result in a list with maxsplit + 1 elements. This can be very useful when you only need to split part of a string, since the remaining pieces will be put into a single element of the list which is returned. For example, suppose you had a file containing definitions of words, with the word being the first string and the definition consisting of the remainder of the line. By setting maxsplit to 1, the word would become the first element of the returned list, and the definition would become the second element of the list, as the following example shows:

>>> line = ’Ni a sound that a knight makes’
>>> line.split(maxsplit=1)
[’Ni’, ’a sound that a knight makes’]

In some versions of python, the split method will not accept a named argument for maxsplit. In that case, you would need to explicitly specify the separator, using None to obtain the default behavior.

>>> line.split(None,1)
[’Ni’, ’a sound that a knight makes’]

When using the join method for strings, remember that the method operates on the string which will be used between each element of the joined list, not on the list itself. This may result in some unusual looking statements:

>>> words = [’spam’,’spam’,’bacon’,’spam’]
>>> ’ ’.join(words)
’spam spam bacon spam’

Of course, you could assign the value of ’ ’ to a variable to improve the appearance of such a statement. The index and find functions can be useful when trying to extract substrings, although techniques using the re module will generally be more powerful. As an example of the use of these functions, suppose we have a string with a parenthesized substring, and we wish to extract just that substring. Using the slicing techniques explained in, and locating the substring using, for example index and rindex, here’s one way to solve the problem:

>>> model = ’Turbo Accelerated Widget (MMX-42b) Press’
>>> try:
... model[model.index(’(’) + 1 : model.rindex(’)’)]
... except ValueError:
... print ’No parentheses found’
...
’MMX-42b’

When you use these functions, make sure to check for the case where the substring is not found, either the ValueError raised by the index functions, or the returned value of -1 from the find functions. Remember that the string methods will not change the value of the string they are acting on, but you can achieve the same effect by overwriting the string with the returned value of the method. For example, to replace a string with an equivalent version consisting of all upper-case characters, statements like the following could be used:

>>> language = ’python’
>>> language = language.upper()
>>> language
’PYTHON’

Finally, python offers a variety of so-called predicate methods, which take no arguments, and return 1 if all the characters in a string are of a particular type, and 0 otherwise. These functions, whose use should be obvious from their names, include isalnum, isalpha, isdigit, islower, isspace, istitle, and isupper.

Related modules: string, re, stringIO.

Related exceptions: TypeError, IndexError.