Can Get C++ to Skip Reading a Line

How to extract specific portions of a text file using Python

Updated: 06/30/2020 by Computer Promise

Python programming language logo

Extracting text from a file is a common task in scripting and programming, and Python makes it piece of cake. In this guide, we'll talk over some simple means to extract text from a file using the Python 3 programming language.

Make sure you're using Python 3

In this guide, we'll exist using Python version 3. Almost systems come pre-installed with Python two.7. While Python 2.7 is used in legacy code, Python three is the nowadays and future of the Python linguistic communication. Unless you have a specific reason to write or back up Python 2, we recommend working in Python 3.

For Microsoft Windows, Python 3 can be downloaded from the Python official website. When installing, make certain the "Install launcher for all users" and "Add Python to PATH" options are both checked, as shown in the prototype beneath.

Installing Python 3.7.2 for Windows.

On Linux, yous tin can install Python 3 with your package director. For instance, on Debian or Ubuntu, you tin install it with the following command:

sudo apt-go update && sudo apt-get install python3

For macOS, the Python iii installer tin can be downloaded from python.org, as linked above. If you are using the Homebrew package manager, it can also exist installed by opening a terminal window (ApplicationsUtilities), and running this control:

brew install python3

Running Python

On Linux and macOS, the command to run the Python iii interpreter is python3. On Windows, if you lot installed the launcher, the command is py. The commands on this page use python3; if you're on Windows, substitute py for python3 in all commands.

Running Python with no options starts the interactive interpreter. For more information nigh using the interpreter, come across Python overview: using the Python interpreter. If yous accidentally enter the interpreter, you can exit it using the command go out() or quit().

Running Python with a file name will interpret that python plan. For example:

python3 programme.py

...runs the program contained in the file program.py.

Okay, how can we use Python to extract text from a text file?

Reading information from a text file

First, permit's read a text file. Permit's say we're working with a file named lorem.txt, which contains lines from the Lorem Ipsum example text.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.

Annotation

In all the examples that follow, we work with the four lines of text contained in this file. Copy and paste the latin text above into a text file, and save it as lorem.txt, and so y'all can run the example code using this file equally input.

A Python program can read a text file using the built-in open() office. For example, the Python three programme beneath opens lorem.txt for reading in text manner, reads the contents into a string variable named contents, closes the file, and prints the data.

myfile = open("lorem.txt", "rt") # open lorem.txt for reading text contents = myfile.read()         # read the unabridged file to cord myfile.shut()                   # close the file print(contents)                  # print string contents

Here, myfile is the name nosotros give to our file object.

The "rt" parameter in the open up() function means "we're opening this file to read text information"

The hash mark ("#") means that everything on that line is a annotate, and information technology'due south ignored by the Python interpreter.

If you relieve this program in a file chosen read.py, you can run it with the post-obit command.

python3 read.py

The command in a higher place outputs the contents of lorem.txt:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.

Using "with open"

It'south important to close your open files as shortly as possible: open the file, perform your operation, and close it. Don't leave it open for extended periods of time.

When you're working with files, it's good exercise to utilise the with open...as compound statement. Information technology's the cleanest way to open up a file, operate on it, and shut the file, all in one easy-to-read cake of code. The file is automatically closed when the lawmaking cake completes.

Using with open...as, we tin rewrite our program to look like this:

with open ('lorem.txt', 'rt') as myfile:  # Open up lorem.txt for reading text     contents = myfile.read()              # Read the entire file to a string print(contents)                           # Print the string

Annotation

Indentation is of import in Python. Python programs apply white space at the beginning of a line to define telescopic, such equally a block of code. We recommend yous use four spaces per level of indentation, and that yous use spaces rather than tabs. In the following examples, make sure your code is indented exactly as it's presented here.

Example

Save the program every bit read.py and execute it:

python3 read.py

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.

Reading text files line-by-line

In the examples then far, we've been reading in the whole file at once. Reading a full file is no big deal with small files, but generally speaking, information technology'south not a great thought. For i thing, if your file is bigger than the amount of available memory, you'll come across an error.

In almost every case, it'south a ameliorate thought to read a text file one line at a time.

In Python, the file object is an iterator. An iterator is a type of Python object which behaves in certain ways when operated on repeatedly. For instance, you tin can use a for loop to operate on a file object repeatedly, and each time the same operation is performed, you'll receive a different, or "adjacent," result.

Example

For text files, the file object iterates one line of text at a fourth dimension. It considers one line of text a "unit" of data, so we can utilize a for...in loop statement to iterate one line at a time:

with open ('lorem.txt', 'rt') as myfile:  # Open up lorem.txt for reading     for myline in myfile:              # For each line, read to a string,         print(myline)                  # and print the string.

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.  Nunc fringilla arcu congue metus aliquam mollis.  Mauris nec maximus purus. Maecenas sit amet pretium tellus.  Quisque at dignissim lacus.

Discover that we're getting an extra line intermission ("newline") after every line. That's because two newlines are beingness printed. The first i is the newline at the cease of every line of our text file. The 2nd newline happens because, past default, impress() adds a linebreak of its own at the end of whatever you've asked it to print.

Permit's store our lines of text in a variable — specifically, a list variable — then we can look at it more closely.

Storing text information in a variable

In Python, lists are similar to, merely non the same as, an array in C or Coffee. A Python list contains indexed data, of varying lengths and types.

Example

mylines = []                             # Declare an empty listing named mylines. with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text data.     for myline in myfile:                # For each line, stored equally myline,         mylines.append(myline)           # add its contents to mylines. print(mylines)                           # Print the list.

The output of this plan is a fiddling different. Instead of printing the contents of the list, this plan prints our list object, which looks like this:

Output:

['Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n', 'Nunc fringilla arcu congue metus aliquam mollis.\n', 'Mauris nec maximus purus. Maecenas sit amet pretium tellus.\due north', 'Quisque at dignissim lacus.\n']

Here, nosotros see the raw contents of the list. In its raw object class, a listing is represented as a comma-delimited list. Here, each element is represented as a string, and each newline is represented as its escape character sequence, \north.

Much like a C or Coffee assortment, the list elements are accessed by specifying an alphabetize number after the variable name, in brackets. Index numbers start at zero — other words, the northwardth element of a list has the numeric index northward-1.

Notation

If you're wondering why the index numbers start at zero instead of ane, you're not alone. Computer scientists take debated the usefulness of zero-based numbering systems in the past. In 1982, Edsger Dijkstra gave his opinion on the subject, explaining why nil-based numbering is the best way to index data in computer science. You can read the memo yourself — he makes a compelling argument.

Instance

Nosotros can print the starting time element of lines past specifying alphabetize number 0, independent in brackets afterwards the name of the listing:

print(mylines[0])

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.

Instance

Or the third line, by specifying alphabetize number 2:

print(mylines[2])

Output:

Quisque at dignissim lacus.

But if we try to access an index for which there is no value, we get an fault:

Case

print(mylines[3])

Output:

Traceback (most contempo call terminal): File <filename>, line <linenum>, in <module> print(mylines[3]) IndexError: listing index out of range

Example

A list object is an iterator, and then to print every element of the list, we tin can iterate over information technology with for...in:

mylines = []                              # Declare an empty list with open ('lorem.txt', 'rt') equally myfile:  # Open lorem.txt for reading text.     for line in myfile:                   # For each line of text,         mylines.append(line)              # add that line to the list.     for element in mylines:               # For each element in the listing,         impress(element)                    # impress it.

Output:

Lorem ipsum dolor sit down amet, consectetur adipiscing elit.  Nunc fringilla arcu congue metus aliquam mollis.  Mauris nec maximus purus. Maecenas sit amet pretium tellus.  Quisque at dignissim lacus.

Just nosotros're nevertheless getting actress newlines. Each line of our text file ends in a newline character ('\due north'), which is being printed. As well, after press each line, print() adds a newline of its own, unless y'all tell it to do otherwise.

Nosotros can change this default behavior by specifying an stop parameter in our impress() phone call:

print(element, stop='')

By setting end to an empty string (two single quotes, with no space), we tell print() to print nothing at the end of a line, instead of a newline character.

Case

Our revised plan looks like this:

mylines = []                              # Declare an empty listing with open ('lorem.txt', 'rt') as myfile:  # Open file lorem.txt     for line in myfile:                   # For each line of text,         mylines.append(line)              # add that line to the listing.     for chemical element in mylines:               # For each element in the list,         print(element, end='')            # print information technology without extra newlines.

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.

The newlines you come across here are really in the file; they're a special character ('\due north') at the stop of each line. We desire to get rid of these, so nosotros don't have to worry about them while nosotros process the file.

How to strip newlines

To remove the newlines completely, we can strip them. To strip a string is to remove one or more characters, commonly whitespace, from either the beginning or end of the string.

Tip

This procedure is sometimes also called "trimming."

Python 3 string objects have a method called rstrip(), which strips characters from the right side of a cord. The English linguistic communication reads left-to-right, so stripping from the right side removes characters from the stop.

If the variable is named mystring, we tin can strip its right side with mystring.rstrip(chars), where chars is a string of characters to strip. For example, "123abc".rstrip("bc") returns 123a.

Tip

When you represent a string in your program with its literal contents, it's called a cord literal. In Python (as in most programming languages), string literals are always quoted — enclosed on either side by single (') or double (") quotes. In Python, single and double quotes are equivalent; yous can use i or the other, as long as they match on both ends of the cord. It'southward traditional to represent a man-readable string (such as Hello) in double-quotes ("Hello"). If you're representing a single character (such as b), or a unmarried special character such as the newline character (\n), it's traditional to use single quotes ('b', '\north'). For more than data about how to use strings in Python, you tin read the documentation of strings in Python.

The statement string.rstrip('\n') will strip a newline character from the right side of cord. The following version of our program strips the newlines when each line is read from the text file:

mylines = []                                # Declare an empty list. with open ('lorem.txt', 'rt') as myfile:    # Open lorem.txt for reading text.     for myline in myfile:                   # For each line in the file,         mylines.append(myline.rstrip('\northward')) # strip newline and add together to list. for chemical element in mylines:                     # For each element in the list,     print(element)                          # impress it.

The text is now stored in a list variable, so individual lines tin can exist accessed past index number. Newlines were stripped, so nosotros don't have to worry about them. We can always put them back later if we reconstruct the file and write it to deejay.

At present, permit's search the lines in the listing for a specific substring.

Searching text for a substring

Let'south say we want to locate every occurrence of a certain phrase, or even a single letter. For instance, maybe we demand to know where every "east" is. We can accomplish this using the string's notice() method.

The list stores each line of our text as a string object. All string objects have a method, discover(), which locates the first occurrence of a substrings in the string.

Permit'south use the find() method to search for the alphabetic character "east" in the first line of our text file, which is stored in the list mylines. The starting time element of mylines is a string object containing the first line of the text file. This cord object has a find() method.

In the parentheses of discover(), we specify parameters. The start and but required parameter is the cord to search for, "e". The statement mylines[0].notice("due east") tells the interpreter to search forward, starting at the beginning of the string, one character at a time, until it finds the letter "e." When it finds i, it stops searching, and returns the alphabetize number where that "east" is located. If it reaches the finish of the string, it returns -1 to bespeak nothing was found.

Example

print(mylines[0].find("e"))

Output:

3

The return value "iii" tells the states that the letter "e" is the fourth graphic symbol, the "eastward" in "Lorem". (Remember, the index is zero-based: index 0 is the first character, 1 is the 2d, etc.)

The find() method takes two optional, additional parameters: a start index and a stop alphabetize, indicating where in the string the search should begin and end. For instance, cord.find("abc", 10, twenty) searches for the substring "abc", but merely from the 11th to the 21st character. If stop is not specified, notice() starts at index start, and stops at the stop of the string.

Example

For instance, the following statement searchs for "e" in mylines[0], beginning at the 5th character.

print(mylines[0].detect("eastward", iv))

Output:

24

In other words, starting at the 5th grapheme in line[0], the kickoff "east" is located at alphabetize 24 (the "eastward" in "nec").

Example

To get-go searching at index 10, and stop at alphabetize 30:

print(mylines[i].find("east", 10, 30))

Output:

28

(The commencement "e" in "Maecenas").

If observe() doesn't locate the substring in the search range, it returns the number -1, indicating failure:

print(mylines[0].find("e", 25, 30))

Output:

-one

At that place were no "eastward" occurrences betwixt indices 25 and xxx.

Finding all occurrences of a substring

But what if we desire to locate every occurrence of a substring, not just the start one we run across? We can iterate over the cord, starting from the index of the previous match.

In this example, nosotros'll utilize a while loop to repeatedly find the letter of the alphabet "e". When an occurrence is plant, we call find once more, starting from a new location in the string. Specifically, the location of the last occurrence, plus the length of the string (so nosotros can move forward past the last one). When find returns -ane, or the start index exceeds the length of the cord, we stop.

# Build array of lines from file, strip newlines  mylines = []                                # Declare an empty list. with open ('lorem.txt', 'rt') as myfile:    # Open lorem.txt for reading text.     for myline in myfile:                   # For each line in the file,         mylines.append(myline.rstrip('\n')) # strip newline and add together to list.  # Locate and print all occurences of letter "e"  substr = "e"                  # substring to search for. for line in mylines:          # cord to exist searched   index = 0                   # current index: graphic symbol being compared   prev = 0                    # previous index: last character compared   while alphabetize < len(line):    # While alphabetize has not exceeded string length,     index = line.find(substr, index)  # set index to first occurrence of "e"     if index == -i:           # If nothing was found,       break                   # exit the while loop.     print(" " * (index - prev) + "e", end='')  # print spaces from previous                                                # match, and then the substring.     prev = index + len(substr)       # remember this position for next loop.     alphabetize += len(substr)      # increase the index past the length of substr.                               # (Repeat until index > line length)   impress('\northward' + line);         # Impress the original cord under the e's        

Output:

          e                    e       e  due east               e Lorem ipsum dolor sit amet, consectetur adipiscing elit.                          e  e Nunc fringilla arcu congue metus aliquam mollis.         due east                   e e          due east    e      e Mauris nec maximus purus. Maecenas sit amet pretium tellus.       eastward Quisque at dignissim lacus.

Incorporating regular expressions

For complex searches, utilize regular expressions.

The Python regular expressions module is called re. To use it in your program, import the module before y'all use information technology:

import re

The re module implements regular expressions by compiling a search pattern into a pattern object. Methods of this object tin and then exist used to perform match operations.

For example, let's say you want to search for any word in your document which starts with the alphabetic character d and ends in the letter r. We tin accomplish this using the regular expression "\bd\w*r\b". What does this hateful?

character sequence meaning
\b A word boundary matches an empty string (annihilation, including zippo at all), but only if it appears earlier or after a non-word character. "Discussion characters" are the digits 0 through 9, the lowercase and upper-case letter letters, or an underscore ("_").
d Lowercase letter d.
\w* \w represents any give-and-take character, and * is a quantifier meaning "nil or more of the previous character." And then \westward* will friction match goose egg or more than discussion characters.
r Lowercase alphabetic character r.
\b Word boundary.

Then this regular expression will friction match whatsoever cord that can exist described as "a give-and-take boundary, then a lowercase 'd', and then zero or more than discussion characters, then a lowercase 'r', and so a discussion boundary." Strings described this way include the words destroyer, dour, and physician, and the abridgement dr.

To use this regular expression in Python search operations, we commencement compile it into a pattern object. For instance, the post-obit Python argument creates a pattern object named pattern which nosotros can utilise to perform searches using that regular expression.

pattern = re.compile(r"\bd\west*r\b")

Note

The letter r before our string in the statement above is of import. Information technology tells Python to translate our string as a raw string, exactly as we've typed it. If we didn't prefix the string with an r, Python would translate the escape sequences such as \b in other ways. Whenever y'all need Python to interpret your strings literally, specify it every bit a raw string by prefixing it with r.

At present we tin can use the pattern object's methods, such equally search(), to search a string for the compiled regular expression, looking for a match. If it finds one, information technology returns a special outcome called a match object. Otherwise, it returns None, a congenital-in Python constant that is used like the boolean value "false".

import re str = "Good morning time, doctor." pat = re.compile(r"\bd\w*r\b")  # compile regex "\bd\due west*r\b" to a pattern object if pat.search(str) != None:     # Search for the pattern. If establish,     print("Constitute it.")

Output:

Establish information technology.

To perform a case-insensitive search, you can specify the special constant re.IGNORECASE in the compile step:

import re str = "Hi, DoctoR." pat = re.compile(r"\bd\westward*r\b", re.IGNORECASE)  # upper and lowercase will match if pat.search(str) != None:     print("Establish it.")

Output:

Found it.

Putting information technology all together

Then now nosotros know how to open a file, read the lines into a list, and locate a substring in any given list element. Allow's apply this knowledge to build some example programs.

Print all lines containing substring

The program below reads a log file line past line. If the line contains the word "mistake," it is added to a list called errors. If not, it is ignored. The lower() string method converts all strings to lowercase for comparison purposes, making the search case-insensitive without altering the original strings.

Note that the find() method is called directly on the result of the lower() method; this is called method chaining. Too, note that in the print() statement, we construct an output cord by joining several strings with the + operator.

errors = []                       # The list where we will store results. linenum = 0 substr = "mistake".lower()          # Substring to search for. with open ('logfile.txt', 'rt') equally myfile:     for line in myfile:         linenum += i         if line.lower().find(substr) != -one:    # if case-insensitive friction match,             errors.append("Line " + str(linenum) + ": " + line.rstrip('\n')) for err in errors:     print(err)

Input (stored in logfile.txt):

This is line i This is line 2 Line 3 has an error! This is line 4 Line v as well has an error!

Output:

Line three: Line three has an error! Line 5: Line 5 also has an error!

Extract all lines containing substring, using regex

The program below is similar to the above program, but using the re regular expressions module. The errors and line numbers are stored as tuples, e.one thousand., (linenum, line). The tuple is created by the additional enclosing parentheses in the errors.append() statement. The elements of the tuple are referenced similar to a list, with a zilch-based alphabetize in brackets. As constructed here, err[0] is a linenum and err[i] is the associated line containing an mistake.

import re errors = [] linenum = 0 pattern = re.compile("mistake", re.IGNORECASE)  # Compile a case-insensitive regex with open ('logfile.txt', 'rt') as myfile:         for line in myfile:         linenum += one         if pattern.search(line) != None:      # If a match is found              errors.suspend((linenum, line.rstrip('\n'))) for err in errors:                            # Iterate over the listing of tuples     impress("Line " + str(err[0]) + ": " + err[1])

Output:

Line vi: Mar 28 09:ten:37 Error: cannot contact server. Connection refused. Line 10: Mar 28 x:28:15 Kernel error: The specified location is non mounted. Line 14: Mar 28 11:06:30 Mistake: usb 1-1: can't set config, exiting.

Extract all lines containing a phone number

The program below prints whatsoever line of a text file, info.txt, which contains a US or international phone number. Information technology accomplishes this with the regular expression "(\+\d{1,2})?[\s.-]?\d{iii}[\s.-]?\d{4}". This regex matches the following phone number notations:

  • 123-456-7890
  • (123) 456-7890
  • 123 456 7890
  • 123.456.7890
  • +91 (123) 456-7890
import re errors = [] linenum = 0 pattern = re.compile(r"(\+\d{1,two})?[\s.-]?\d{three}[\s.-]?\d{4}") with open ('info.txt', 'rt') every bit myfile:     for line in myfile:         linenum += one         if pattern.search(line) != None:  # If pattern search finds a match,             errors.suspend((linenum, line.rstrip('\due north'))) for err in errors:     print("Line ", str(err[0]), ": " + err[ane])

Output:

Line  three : My phone number is 731.215.8881. Line  7 : You can attain Mr. Walters at (212) 558-3131. Line  12 : His amanuensis, Mrs. Kennedy, can exist reached at +12 (123) 456-7890 Line  fourteen : She can likewise be contacted at (888) 312.8403, extension 12.

Search a lexicon for words

The program below searches the dictionary for any words that start with h and cease in pe. For input, information technology uses a dictionary file included on many Unix systems, /usr/share/dict/words.

import re filename = "/usr/share/dict/words" pattern = re.compile(r"\bh\westward*pe$", re.IGNORECASE) with open(filename, "rt") every bit myfile:     for line in myfile:         if pattern.search(line) != None:             print(line, end='')

Output:

Hope heliotrope promise hornpipe horoscope hype

bruntheaut.blogspot.com

Source: https://www.computerhope.com/issues/ch001721.htm

0 Response to "Can Get C++ to Skip Reading a Line"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel