OS Programme Lecture #4
1. BASH Programming(用unix系统) Read one-million words from text files:
一个更复杂的脚本程序
从布朗语料库(第一个机读语料库),Brown Corpus,提取词汇和词汇使用频率
该脚本自动遍历brown文件夹里的每一个文件,提取词库中的词语和他们的使用频率
程序可以移除一些符号例如',[,],$,创建字数统计在hashmap数据结构中(也被称作“字典”)
一旦所有数据文件都被读取,这个脚本在最后一个for循环中打印词汇使用频率
可使用man sed查询sed关键词含义并尝试理解
创建WordFrequencies.sh,写入以下代码:
declare -A hashmap
for file in brown/*[0-9]; do
echo "Reading $file"
echo "Reading $file"
echo "Reading $file"
echo "Reading $file"
sed 's_([^ ]*)/[^ ]*_1_g' $file > t1.txt
sed "s/'//g" t1.txt > t2.txt
sed "s/`//g" t2.txt > t3.txt
sed "s/[//g" t3.txt > t4.txt
sed "s/]//g" t4.txt > t5.txt
sed "s/\$//g" t5.txt > t6.txt
while read -r line; do
line="$line"
if [ ${#line} -gt 0 ]; then
#echo $line
for word in $line; do
if [ ${#word} -gt 0 ]; then
#echo ${word}
if [ ${hashmap[${word}]+_} ]; then
let hashmap[$word]=$((hashmap[${word}]+1))
else
let hashmap[$word]=1
fi
fi
done
fi
done < "t6.txt"
done
for i in "${!hashmap[@]}"; do
echo $i ${hashmap[$i]}
done
运行!(要有耐心,脚本会运行较长时间!)
注释以上代码中某些行,再看一下程序的输出变化,以加深理解
再尝试以下代码,完成作业中的问题,参考代码:
declare -A hashmap
for file in brown/*[0-9]; do
echo "Reading $file"
sed 's_([^ ]*)/[^ ]*_1_g' $file > t1.txt
sed "s/'//g" t1.txt > t2.txt
sed "s/`//g" t2.txt > t3.txt
sed "s/[//g" t3.txt > t4.txt
sed "s/]//g" t4.txt > t5.txt
sed "s/\$//g" t5.txt > t6.txt
while read -r line; do
line="$line"
if [ ${#line} -gt 0 ]; then
#echo $line
for word in $line; do
if [ ${#word} -gt 0 ]; then
#echo ${word}
if [ ${hashmap[${word}]+_} ]; then
let hashmap[$word]=$((hashmap[${word}]+1))
else
let hashmap[$word]=1
fi
fi
done
fi
done < "t6.txt"
#break
done
numWords=0
topWord=""
topFreq=0
sumFreq=0
for i in "${!hashmap[@]}"; do
echo $i ${hashmap[$i]}
let numWords=$numWords+1
if [ $topFreq -lt ${hashmap[$i]} ]; then
topWord=$i
topFreq=${hashmap[$i]}
fi
let sumFreq+=${hashmap[$i]}
done
avgFreq=`echo $sumFreq/$numWords | bc -l`
echo "What is the total number of words? Answer="$numWords
echo "What is the most frequent word? Answer="$topWord
echo "What is the number of hits of the most frequent word? Answer="$topFreq
echo "Average word frequency="$avgFreq
echo "Does the memory used grow as your script reads more data, and why? Answer=Yes, because the variable 'hashmap' grows with more data."
2. Process Management:
(1) BASH - Process execution
首先,我们写一个无限循环的脚本loop.sh,参考代码如下:
#!/bin/bash
let num=1
while true; do
let square=$num*$num
echo $num $square
let num=$num+1
done
echo "Program terminated ..."
Ctrl+C可以终止脚本运行
在第一个控制台中运行ps aux
打开一个新的terminal,运行ps aux | grep bash
回到第一个控制台,运行loop.sh
切换到新的控制台,运行ps aux | grep bash,运行ps aux | awk '$8 == "R+"',比较结果
回到第一个控制台,杀死进程loop.sh
切换到新的控制台,运行ps aux | grep bash,运行ps aux | awk '$8 == "R+"'
比较结果!
The ps aux command is a tool to monitor processes running on your Linux system.
A process is associated with any program running on your system, and is used to manage and monitor a program’s memory usage, processor time, and I/O resources.
(2) BASH - process termination with the kill command
用kill终止一个进程运行:
在第一个控制台中运行loop.sh无限循环脚本
切换到第二个控制台,查找运行loop.sh脚本的进程,记录此进程PID
我们现在用kill来终止此进程,在第二个控制台中,尝试运行 kill PID
回到第一个控制台,看一下脚本有没有被终止?
3. 作业:
(1) 熟悉操作WordFrequencies.sh脚本,根据以上参考代码的执行,尝试回答以下几个问题:
- 使用的内存在你的脚本阅读更多数据时,会不会增加?(打开系统监控,观察内存使用情况)
- 词汇总量是多少?
- 使用最频繁的词是什么?
- 最常见单词的命中次数是多少?
- 平均的字数是多少?
(2) 运行(1)WordFrequencies.sh脚本,在过程中打开另一个terminal进行进程监视,随后回到运行该进程的控制台kill该进程,把过程和ps aux | grep bash结果贴图写入实验记录