User:Erbingha/Text Editing

From WolfTech
Jump to navigation Jump to search

The "editing" function as described here is simply the input and manipulation of bytes or characters in a file. On Macintosh and PC systems, this function is often combined with the "formatting" function of laying out text on a page, into a single program called a "word processor", e.g., Microsoft Word, MacWrite, WordPerfect, etc. In UNIX, editing and formatting functions are strictly separate. WARNING: if you want to use a Mac/PC word processing program to edit a file and then transfer it to a UNIX system for further processing, be sure to save it with the "TEXT-ONLY W/ LINE BREAKS" format on the Mac/PC. Otherwise, the special formatting control characters inserted by the word processing program will make the file difficult to use with UNIX utilities or programs. Macintosh and PC systems are often more friendly when making minor changes to formatted documents - you can see the formatting changes as you update the file contents. However, the UNIX method of separating editing from formatting allows much more powerful editors, especially for writing computer programs or manipulating data files. Editor choices on UNIX There are three basic interactive text editors that are found on pangea and commonly on UNIX computers throughout the University: vi, pico, and emacs. In addition to these normal "screen" editors that let you make changes to text interactively, there is also a "streaming" editor named sed, which operates on a file in a non-interactive mode according to a set of instructions that you specify on the command line or in a script. It can be used for "canned" editing tasks that must be applied in the same way to many files.

vi

vi (pronounced "vee-eye") is the standard editor for UNIX systems. It is universally available on UNIX systems and will be described in detail in these notes.

pico

Many users of pangea and other UNIX systems around campus are familiar with the pine e-mail reader program. When you use this program to send e-mail, you have some simple editing commands available to compose your message, such as the ability to delete lines or insert new text at arbitrary locations. This very simple editor that is built into pine is also available as a stand-alone program for general use with the name pico. It is well-suited for the person who only needs to make occasional simple changes to UNIX files, such as modifying his .login initialization file. pico is so named because it is based on a very tiny subset of the editing commands in the large general purpose editor emacs. The pico editor has six basic features: adding/deleting/changing text, paragraph justification, searching, block cut/paste, a spelling checker, and a file browser. One of the advantages of pico is that the list of commands you can use is always displayed on the screen, so you never have to remember them, and context-sensitive help is always available. The style of pico is that you are always in insert mode: typing normal characters just inserts them into the file at the location of the cursor, and the delete (backspace) key simply erases the character immediately before the cursor. The arrow keys on the keyboard move the cursor around on the screen. More sophisticated functions than simple insertion or deletion of text characters are accomplished with commands that require the use of the CONTROL key. You hold down the CONTROL key like a shift key and then press another key. This is shown in the on-screen command list with a syntax like ^G, where the ^ character stands for the control key and is followed by the letter G which must also be pressed to run that command. You start pico from the shell with a simple command that gives the name of the file you want to edit, for example pico.login. If you want to create a new file, just give the name you want to use for that file as the argument to pico. Once you are in the pico editor, your screen will be divided into three regions. The top line is a status line that shows the name of the file you are editing. The bottom two lines list the commands that are available for you, including the help command. Everything in between is your window into the file. Move the cursor around with the arrow keys and insert or delete text. Use the CONTROL key commands listed on the screen to move the screen window to another section of file or perform the general editing functions such as cut and paste. Use the help command to get more detailed information on the other commands.

emacs

emacs is a widely used editor available on many types of systems. The "GNU" version is installed on pangea. This editor is very large and fully programmable with a built-in object oriented language. As a result, it has become more of a shell than an editor. Using "macros", it is possible to do many non-editing functions from within emacs, including compiling and debugging programs and browsing the web! I do not describe how to use emacs in these notes because I consider it far too complicated and bloated for simple editing. Entire books are available to teach emacs. If you are already familiar with emacs, you can use it on pangea and most other UNIX systems on campus. Simply type the shell command emacs to get started. From an X-window session on pangea, you can use the alternate command name xemacs to open a separate X window for each editing session.

Characteristics, advantages, and disadvantages of vi vi is a "screen editor". It treats your computer screen as a "window" into the file. You move the window around to view different parts of the file. You move the cursor to the location on the screen where you want to make a change; or optionally, you specify some kind of global change. vi updates your screen to reflect changes that you make in the file. Actually, it works on a copy of the file in memory, and only updates the file on disk when you tell it to, such as when you end the editing session. In every editor, the input device (keyboard) has two functions: 1. Insert text into the file. 2. Give commands to instruct the editor what to do. Every editor has to have some means of keeping these functions separated. Most editors do this by either using special function keys (or a mouse) to give or start commands, or by requiring that the cursor be located on a special command line or area of the screen to give commands. In these types of editors, the keyboard is normally inserting or replacing text; keystrokes are not normally interpreted as commands to the editor. vi takes a different approach to separating commands from text: modes. vi has two modes: command and insert. When in command mode, which is where you start out, everything you type, every key you press, is interpreted as a command to vi. You have to give special commands to switch to insert mode. When in insert mode, typing causes text to be inserted into the file (actually into the copy in memory, called the "buffer"). You have to press a special key (ESC) to go back to command mode. Main advantages of vi

  • vi is universally available on UNIX systems. It has been around so long in a stable form that it is essentially "bug free". Many clones have been written for other kinds of computers, as well.
  • vi has many powerful commands that utilize just the alphanumeric keys -- does not require special function keys. Experts like this because they can tersely "touch type" the commands.
  • vi is a small program that does not require a lot of system memory or CPU time. It works very fast, even on large files.
  • While vi is not programmable, it has a simple way to let other UNIX programs, such as the sort utility, work on selected portions of your file. This adds the functionality of all those programs to the editor.
  • vi is completely terminal device independent. It will work with any kind of terminal. A system file describes the capabilities and control sequences of each kind of terminal for vi. All the program needs to know is what type of terminal you have. When you log in, if pangea cannot figure out what kind of terminal you have, it will prompt you to specify a terminal type. The most common type is the "vt100", which most modern terminals and PC communications software emulate.

The chief disadvantage of vi is that it is "touchy". That is, every single key you touch on the keyboard seems to do something, often something mysterious. There is a rich set of single character commands to learn. Like most things in UNIX, vi is not "column oriented". It does not have commands for affecting specific columns of each line (say columns 5-15). Rather, it is field or pattern oriented. Commands can be limited to a specified pattern of characters, however.

Basic operations in vi

These notes follow the order of presentation in "An Introduction to Display Editing with Vi", revised by Kroeger. Notations in these notes: Most vi commands are one or two keystrokes, without pressing the RETURN key. When the RETURN key is actually needed, I will explicitly represent it by "". These keystrokes do not show on the screen. For example, the G keystroke will move you to the bottom of the file, but you will never see the "G" appear anywhere on the screen! Many commands use a combination of the CONTROL key and one of the normal keys. This combination is represented in these notes by "CTRL-" followed by the key letter name, shown in uppercase, for example, "CTRL-U". This means to hold down the CONTROL key like a shift key while you press the U key. Other books and references may show CONTROL key combinations using the "^" symbol, for example, "^U" to stand for the combination of the CONTROL and U keys. Although these notations show the uppercase version of the key to be pressed, you are not actually trying to enter an uppercase letter. Do not press the SHIFT key at the same time. Other commands begin with a colon character. These commands usually set some parameter or make a global change. They must be ended by pressing the RETURN key. Search patterns (beginning with a / or ?) must also be ended by pressing the RETURN key.

Invoking and leaving the editor

The command: vi filename starts up the editor with a copy of the file filename, if it exists, loaded into memory and displayed on the screen. If filename does not exist, you get a blank screen that you can use to create it.

Format of the screen in vi.

  • Bottom line is "status line" - displays messages from vi; also used for entering global commands or search commands.
  • Remaining lines are your window into the file. If a particular line of the file is wider than the screen width, it wraps around onto additional lines.
  • If your window goes past the end of the file, the "unused" lines on the screen have a tilde character ("~") in column 1 and nothing else - this distinguishes them from blank lines.
  • You may see a line with an at-sign ("@") in the first column. This happens if a very long line must be wrapped onto more lines than are left on the screen. Rather than show only part of that line, possibly leading you to believe that it is shorter than it is, vi will simply show a group of lines containing the @ in column 1. When you move the screen window so that the entire line can now fit, then it will be displayed instead of those @ lines. vi may also put an @ character at the beginning of a line of text that has been changed, but has not yet been redrawn. This was useful in the days of slow terminal connections. You will probably never see this usage. vi actually works on a copy of your file in memory, called the buffer. Your original file is not changed unless you explicitly instruct vi to do so, usually when you exit the program. Even then, you have the option of discarding your changes. In general, vi gives you the chance to undo changes you have made. Most UNIX systems, including pangea, are also configured to automatically preserve the "buffer" copy of the file if your connection to the computer is accidentally broken or the computer crashes. In this case, you will generally get an e-mail the next day (after the normal nightly system management programs are run) informing you how to recover your saved vi session. Or you can simply run the command vi -r immediately after logging back in from a disconnect or crash to determine if your vi buffer was saved. If you are concerned about losing changes that you have made to the buffer copy, or if you simply want to update the disk version to make a "snapshot" to which you can recover should further editing cause problems, you can instruct vi to update the original disk copy of the file with the editing changes you have made so far by typing the command: :w The alternate form :w newfilename saves the changes to a new disk file named newfilename and leaves the original disk copy untouched. Here are your choices for leaving vi: * To update the original file with any changes you have made, and then exit the vi program, type one of these equivalent commands: ZZ :wq :x * To keep the changes in a new copy of the file, without changing the original, and then exit the vi program, type :wq new_name * To discard all your changes (no files updated on disk), and then exit the vi program, type :q!

Moving around in the file

In these notes, "down" or "forward" in the file means to move to a position FURTHER from the beginning of the file. "Up" or "backward" means to move to a position CLOSER to the beginning of the file. The file, which is simply a byte stream to UNIX, is separated by vi into lines according to the presence of "new-line" characters (a non-printing character). The end of a line is the last character in the line before the new-line, not the last column on the screen. First, let me describe how you move the cursor around on the screen, without changing your location in the file. As part of its device independence, vi does not need special cursor arrow keys to move the cursor around. Instead, the "h", "j", "k", and "l" (letter "ell", not numeral "one") keys (lowercase only) control the cursor movement. If your terminal has arrow keys, and the terminal type has been correctly specified, they should also work. "h" moves the cursor left, "j" moves it down, "k" moves it up, and "l" moves it right, as shown in this schematic: k h l j The j and k keys try to maintain the cursor in the same column on the new line. However, if you move to a line with fewer columns, then the cursor moves left to the end of the line. If the cursor is already on the end of the line, a j or k command keeps the cursor on the end of the new line, even if that is a new column position. Each time you press h, j, k, or l, it moves the cursor one position. You can also move in bigger chunks.

  • "H" (uppercase h) moves the cursor to the beginning of the top line on the screen, "M" (uppercase m) to the beginning of the middle line on the screen, and "L" (uppercase l) to the beginning of the last line on the screen.
  • The "+" or the RETURN key, by themselves, move the cursor to the first non-blank position on the next line. The "-" key has the opposite effect, moving to the previous line.

Other commands let you move the cursor within a line.

  • The "0" (zero) or "^" (caret) keys move the cursor to the first column (whether a blank or a character).
  • The space bar moves one space to the right (same as the "l" key).
  • The "$" key moves to the end of the line. * "w" moves to the right by one word (blank or punctuation delimited).
  • "b" moves to the left by one word.
  • "e" moves to the (right) end of the current word.
  • "W", "B", and "E" (uppercase) have the same effect as their lowercase counterparts, except that they include any adjoining punctuation characters as part of the word. The w, b, and e keys (and their uppercase variants) continue onto the next line when you reach the end of one line.

You can move the window (screen) to show another set of lines in the file.

  • CTRL-F moves your window "forward" one entire screenful, except for two lines of overlap.
  • CTRL-B moves your window "backward" one entire screenful, except for two lines of overlap.
  • CTRL-D moves your window "down" about one-half screenful in the file.
  • CTRL-U moves you "up" about one-half screenful.

You can use a command to go to a specific line in the file. For this purpose, it may be helpful to see line numbers on the screen. To turn on line numbering on screen, use :set number To turn off line numbering, use :set nonumber These commands do not put line numbers into the file, only on the screen! To just find out the number of the line where the cursor is and total size of the file, use CTRL-G. To go to line number "n" in the file, use the command "nG", substituting the actual line number you want for "n". vi will try to move the screen so the desired line is in the middle. A plain "G" with no preceding number goes to the end of the file. An alternate way to move to a desired line "n" in the file is the command :n

Searching the file for specific text

You can instruct vi to search for a line that contains a particular "string" (sequence of characters) and then move your window to show the area around that line. Use the command /abcd where abcd is any sequence of characters, to search "down" through the file for the next line that contains "abcd". Search expressions can actually be "regular expressions", which allow you to match patterns in addition to exact text matches. Be careful of regular expression metacharacters! Escape them by preceding with a backslash if you want to treat them as normal characters. When in doubt whether a character has a special meaning, it never hurts to precede it with a backslash (use two backslashes in a row to get the backslash character itself). Or, you can turn off regular expression searching altogether, which then treats all search characters literally (search for that specific character, rather than the pattern it represents), using this command: :set nomagic Use the command ?abcd to search "up" through the file for the next line containing "abcd". vi remembers your last search command. To continue the search for the next occurrence, just type a single "n". To continue the search in the opposite direction from that in which it started, type "N". vi will wrap around the file when searching. When it reaches the bottom, it starts over again at the top, and vice-versa.

Deleting text

Use the "x" command to remove single characters. Position the cursor over the character to be removed and press the x key. Preceding x with a numeric value removes that many characters; for example, 5x removes five characters starting at the cursor position. Use "r" to replace a single character. Position the cursor where you want to make the replacement, then type r followed by the new character. To remove the entire line that the cursor is on, type "dd". Precede this with a numeric count to remove multiple lines; for example, 12dd removes the next twelve lines starting with the one the cursor is on.

Adding or inserting text

Adding or inserting arbitrary amounts of text requires that you change from "command" mode to "insert" mode. In command mode, single character commands usually take effect immediately when typed. In insert mode, you type the text to be inserted and terminate it (returning to command mode) with the ESC key. People used to word processing programs on Macintosh or PC will notice a significant difference in text input with vi. On most Mac/PC word processing programs, you do not need to press the RETURN key to signal the end of a line; the program will automatically "wrap" text onto the next line, breaking the lines at a word boundary. If you print such text, the line boundaries are preserved. Using the vi editor, unless you have set a special option (see below), the editor does not break new text into separate lines for you as you type. It "wraps" text around on the screen, but this is just its way of showing a very long line. If you print such text, you will only get the first part of the line (whatever fits on one line for that printer) - the rest will be "off the page". This behavior is a consequence of the fact that Mac/PC word processing programs are mixing the editing and formatting functions, but vi is strictly an editor. In vi, you should remember to press the RETURN key where you want to end each line of new text that you enter. You can force vi to break lines for you at the right margin when you are typing new text, so you don't have to press the RETURN key on every line. Remember, however, that there are many situations where you would not want this automatic line breaking, for example, when editing data files or computer source code files that may need to include long lines that must not be broken. If you really want vi to break lines automatically for you, use this command to set automatic line break mode: :set wrapmargin=n The n in this command must be replaced by an integer number of your choosing, which is interpreted by vi to be how close it is allowed to get to the right margin of the screen before it must insert a RETURN character and start a new line. It will only break lines on a blank space between words. Therefore, it may actually break the line sooner than n columns before the right margin. As virtually all terminals and terminal emulation programs use 80 columns as the width of the screen, you can think of this wrapmargin environment variable as forcing the maximum line length for new text to be (80-n) columns. The "wrapmargin" variable can be abbreviated "wm", for example: :set wm=5 To remove this automatic line breaking mode, you can turn off this variable with the command: :set wm=0 When typing new text in insert mode, you can use the backspace key to erase the character you just typed, the CTRL-U to erase all new text on the current line, and CTRL-W to erase the last entire word just typed. The "erased" text does not disappear from the screen until you type over it or press the ESC key; this saves time redrawing the screen. The cursor does move to the left, back over the erased text, however, to show that it has been erased from the file (actually, from the buffer). Here are the actual commands that switch vi into insert mode so you can add new text. "i" and "a" These are the basic insert mode commands. You position the cursor where you desire to insert text and type i to insert before the cursor, or a to insert after the cursor. You then type your text, which can have carriage returns in it (to make multiple lines). When done typing the new text, press the ESC key. "I" (uppercase) is the equivalent of "i", except that it inserts just before the first non-blank character of the line the cursor is on, regardless of which column the cursor is in. "A" (uppercase) is the equivalent of "a", except that it always adds text at the end of the current line. As usual, terminate new text with ESC. "s" replaces the character under the cursor with any amount of new text. This differs from "i" and "a", which inserted or added text before or after the cursor position. As usual, terminate new text with ESC. "o" opens up a new line below the line on which the cursor is positioned and then starts inserting your new text at the beginning of that new line. "O" (uppercase) opens the new line above the line on which the cursor is positioned. Terminate new text with ESC.

Undo and repeat

There is a general "undo" facility in vi to reverse any changes made to the buffer by the most recent command. vi saves information about the most recent change in a separate buffer so it can do this. Typing "u" causes the most recent change to be undone; typing a second "u" in succession "undoes" the "undo", getting your change back. Typing "U" undoes all changes on the current line. The command "." (a single period) will repeat the last change. For example, if you type I ESC (that is, press the "I" key and then press the space bar three times, and then press the "ESC" key), this will insert three blank spaces at the beginning of the current line. You can make the same change to the next line by moving the cursor to any position on the next line and pressing "." (the period key).

Regular expressions

A regular expression is a pattern or template used in a string matching or searching operation. Regular expressions are used by many programs that need to search for text in a file or perform substitutions or other operations on text. Such programs include the vi editor, the grep file searching program, and the data matching and manipulation utilities expr, awk, and sed. For either searching or replacing, regular expressions allow you to work with patterns of characters, not just fixed strings of characters. This allows greater power and flexibility in your commands. In addition to regular characters, regular expressions contain special characters called "metacharacters". These characters mean something other than what they appear to be. They are like variables in a program. Remember that many metacharacters in regular expressions also have a special (different) meaning to the shell. So if you are typing a command at the shell prompt, such as grep, that requires a regular expression as an argument, be sure to enclose the regular expression in a pair of single quotes. In regular expressions, all characters that are not "special" match themselves only. If you want a metacharacter to stand for itself, rather than its special meaning, precede it with the escape character "" (backslash). Obviously, to match a backslash character itself, you need two in a row (the first "escapes" the special meaning of the second backslash as an escape character). The basic metacharacters permit matches of arbitrary characters. . (period) matches any single character. [list] matches any single character of the set given in list (any one of the characters between the brackets). If the set of possible matching characters is in ascending ASCII collating sequence, you can abbreviate the list as a-b, where a and b are the end points of the sequence you want to allow, for example, [a-z] for all lowercase letters. To include "-", "]" or "^" in the list, precede it with the backslash escape character (""). [^list] matches any single character which is not in list. The syntax for the list is the same as shown above. "*" An asterisk that follows a single character, a period, or a bracketed list, means to match zero or more occurrences of that expression. Thus, ab* would match a followed by zero or more occurrences of b. This is different from the behavior of the asterisk when interpreted by the shell as a file name wildcard. .* would match zero or more occurrences of any character - matches anything, including nothing! A second group of metacharacters allow you to "anchor" the match to a location in the line. ^ at the start of a regular expression means that it will only match if it occurs at the beginning of a line. $ at the end of a regular expression means that it will only match if it occurs at the end of a line. ^ and $ only have special meanings if used at the beginning or end, respectively, of a regular expression (or for the case of "^", also if at the beginning of a list of characters in square brackets) -- otherwise they are ordinary characters. The general rule is that a regular expression matches the longest among the possible leftmost matches in a line. For example, if I use the regular expression t.*e with a substitution command in an editor such as vi, and the next line has "the tree is bare", the expression will match not just the first word "the", but the entire phrase "the tree is bare", which starts with the "t" character, has any number of other characters following, and ends with the "e" character.

File searching with grep

grep is an acronym for "global regular expression printer". It finds and prints (to standard output) all lines in the standard input (or specified files) that match or contain a specified regular expression. This allows you to filter the contents of a file, copying only certain lines to a second file (or to a pipeline). Basic syntax: grep regular_expression filenames The regular expression can be a fixed string (like Geology) or it can use metacharacters (such as ".", "*", "^", or "$"). If it uses metacharacters, be sure to enclose the entire regular expression within single quotes (apostrophes) to prevent the shell itself from trying to interpret some of these metacharacters for filename expansion, thus preventing grep from even seeing them. Examples: grep '^[A-Z]' note prints (lists) all lines in file note that begin with a capital letter. grep '^$' note prints all empty (null) lines. grep '^[ ^I][ ^I]*' note prints all lines that begin with at least one, and possibly multiple, blank or tab character. Basically, shows all lines that begin with "white space". grep '[gG]eology' note prints all lines with the word geology, capitalized or not. If you give more than one filename, grep searches each in turn for matching lines. As the matching lines are listed to standard output, they are prefixed with the name of the file in which they were found. For example, suppose I have three files named red, green, and blue. Files green and blue contain the word "Geology". If I use grep to search all three files at once with the command grep Geology red green blue it will list the matching lines to standard output prefixed by the filename, for example: green:Geology ... blue:The Geology Corner ... If you just want to know which files contain a line that matches the regular expression, but don't need to actually see the line(s), then use the -l option to grep. This tells grep to simply list the names of the files (from the input argument list) that contain the regular expression. For example, I could just list the names of the files from my set above that contain at least one line with the word "Geology" using grep -l Geology red green blue which would produce the output: green blue If you want to filter (select) certain lines from a set of files, and don't want grep to prefix each matching line with the name of file where it was found, simply use cat to concatenate all the files together first into one data stream and pipe that to grep, for example: cat red green blue | grep Geology grep has many other options. One of the most useful is "-v". This option inverts the sense of the match; it says to print (send to standard output) only those lines from the input that do not match the expression. For example, empty lines in a file can be matched by the regular expression "^$" (that is, at the beginning, look for the end). So this grep command will select only the non-blank lines from an input file: grep -v '^$' inputfile Warning: Be careful when using grep to search multiple files with output redirection. Don't ever try a command like this: grep sometext * > textout This may immediately fill up the disk if there are any matching lines in any of the files. Why? The problem here is that the shell first creates an empty output file textout, and then interprets the * wildcard character to match all files in the directory, including the new output file. The list of files, sorted alphabetically, is passed to grep. If grep finds any matching lines in the files before textout, it will add them to the end of the textout file. Then, when it opens textout to search it (remember, the shell ended up including textout itself in the list of files that match the * wildcard character) it will find those matching lines and append them to the end of textout, and then read those newly appended lines and append them again, ad infinitum, until the disk fills up completely. A solution here is to direct the output to a file in another directory, so the output file is not matched by the asterisk wildcard character. On many UNIX systems, there are actually several "grep" commands with slightly different names that are optimized for different situations. Typically, you will find these two programs in addition to standard grep. Check the on-line manual entries to see the special syntax for these variants. fgrep A faster version that only uses fixed search strings, not regular expressions egrep An extended, generally faster version that handles logical "and" or "or" of regular expressions, but cannot handle some very large files. More about regular expressions Search and substitution patterns in the vi editor use the same regular expression syntax, as do other text processing utilities in UNIX. Some commands have extensions or additional metacharacters beyond this basic set. Pangea, and other UNIX systems based primarily on Berkeley UNIX, do not permit repeated pattern match specifications of the form "{n,m}" or "{m}", as described in McGilton. What about logical combinations of regular expressions? There is no standard set of logical operators used by all utilities. Logical "and" is relatively straightforward. You can use an expression that links the two desired strings with ".*" (match any characters in between). This only matches in the order given; not a true logical "and". Or, pipe two grep commands together where each matches one expression. Examples: grep 'use.*this' filename finds lines in filename that contain "use" followed by "this", separated by zero or more characters. grep 'use' filename | grep 'this' finds all lines in filename that contain the string "use", and then further restricts that to the subset that also contains "this". Here, it does not matter whether "this" comes before or after "use" on the line. egrep allows the "|" character as a logical "or" operator. Use to separate two regular expressions; it matches the line if either expression matches the line. For clarity about what is to be or'd, put the alternative regular expressions in sets of parentheses. To prevent the shell from thinking that those parentheses, vertical line, and other metacharacters should be interpreted by it (as process control and pipe symbols), enclose the entire expression in single quotes ('). Intermediate text editing with vi. Continue on to the note on intermediate text editing in vi to find out how to operate on larger chunks of text at a time, move text around within a file or to another file, make global substitutions within the file, run another UNIX command and insert the output into your file, send lines from your file to another UNIX command for processing (such as sorting) and then return the output back to your file, and other topics.