데이터분석 전처리 Pandas 필수 함수(iloc, loc, groupby, drop 등) 뽀개기

2021. 1. 16. 05:00빅데이터 잡아라/ADP

반응형
Data Analysis
  • 필수 라이브러리 import
  • 데이터 불러오기
In [2]:
import pandas as pd
pd.options.display.max_columns = 20
pd.options.display.max_rows = 10

absolute_path = '../Data/FoodAgricultureOrganization/Food_Agriculture_Organization_UN_Full.csv'
raw_data = pd.read_csv(absolute_path, sep=',')


  • 데이터 확인하기
In [3]:
raw_data.head()
Out[3]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 Afghanistan 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 Afghanistan 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200

5 rows × 63 columns

  • 데이터 타입 확인하기
In [4]:
raw_data.dtypes
Out[4]:
Area Abbreviation     object
Area Code              int64
Area                  object
Item Code              int64
Item                  object
                      ...
Y2009                float64
Y2010                float64
Y2011                float64
Y2012                  int64
Y2013                  int64
Length: 63, dtype: object
  • string타입으로 바꾸는 astype()
In [5]:
raw_data['Item Code'] = raw_data['Item Code'].astype(str)
  • 데이터 정보 확인하기
In [6]:
raw_data.describe()
Out[6]:
Area Code Element Code latitude longitude Y1961 Y1962 Y1963 Y1964 Y1965 Y1966 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
count 21477.000000 21477.000000 21477.000000 21477.000000 17938.000000 17938.000000 17938.000000 17938.000000 17938.000000 17938.000000 ... 21128.000000 21128.000000 21373.000000 21373.000000 21373.000000 21373.000000 21373.000000 21373.000000 21477.000000 21477.000000
mean 125.449411 5211.687154 20.450613 15.794445 195.262069 200.782250 205.464600 209.925577 217.556751 225.988962 ... 486.690742 493.153256 496.319328 508.482104 522.844898 524.581996 535.492069 553.399242 560.569214 575.557480
std 72.868149 146.820079 24.628336 66.012104 1864.124336 1884.265591 1861.174739 1862.000116 2014.934333 2100.228354 ... 5001.782008 5100.057036 5134.819373 5298.939807 5496.697513 5545.939303 5721.089425 5883.071604 6047.950804 6218.379479
min 1.000000 5142.000000 -40.900000 -172.100000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -169.000000 -246.000000
25% 63.000000 5142.000000 6.430000 -11.780000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 120.000000 5142.000000 20.590000 19.150000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 6.000000 6.000000 7.000000 7.000000 7.000000 7.000000 7.000000 8.000000 8.000000 8.000000
75% 188.000000 5142.000000 41.150000 46.870000 21.000000 22.000000 23.000000 24.000000 25.000000 26.000000 ... 75.000000 77.000000 78.000000 80.000000 82.000000 83.000000 83.000000 86.000000 88.000000 90.000000
max 276.000000 5521.000000 64.960000 179.410000 112227.000000 109130.000000 106356.000000 104234.000000 119378.000000 118495.000000 ... 360767.000000 373694.000000 388100.000000 402975.000000 425537.000000 434724.000000 451838.000000 462696.000000 479028.000000 489299.000000

8 rows × 57 columns

  • columns이 많을때는 describe().T로 transpose해주기!
In [7]:
raw_data.describe().T
Out[7]:
count mean std min 25% 50% 75% max
Area Code 21477.0 125.449411 72.868149 1.0 63.00 120.00 188.00 276.00
Element Code 21477.0 5211.687154 146.820079 5142.0 5142.00 5142.00 5142.00 5521.00
latitude 21477.0 20.450613 24.628336 -40.9 6.43 20.59 41.15 64.96
longitude 21477.0 15.794445 66.012104 -172.1 -11.78 19.15 46.87 179.41
Y1961 17938.0 195.262069 1864.124336 0.0 0.00 1.00 21.00 112227.00
... ... ... ... ... ... ... ... ...
Y2009 21373.0 524.581996 5545.939303 0.0 0.00 7.00 83.00 434724.00
Y2010 21373.0 535.492069 5721.089425 0.0 0.00 7.00 83.00 451838.00
Y2011 21373.0 553.399242 5883.071604 0.0 0.00 8.00 86.00 462696.00
Y2012 21477.0 560.569214 6047.950804 -169.0 0.00 8.00 88.00 479028.00
Y2013 21477.0 575.557480 6218.379479 -246.0 0.00 8.00 90.00 489299.00

57 rows × 8 columns

  • 데이터 전처리 iloc, loc 함수 뽀개기!
  • iloc[0]은 행을 뽑아 냅니다. 그런데 별로 안예쁘죠?
In [8]:
raw_data.iloc[0]
Out[8]:
Area Abbreviation                    AF
Area Code                             2
Area                        Afghanistan
Item Code                          2511
Item                 Wheat and products
                            ...
Y2009                              4538
Y2010                              4605
Y2011                              4711
Y2012                              4810
Y2013                              4895
Name: 0, Length: 63, dtype: object
  • iloc[[0]] 이렇게 괄호 2개를 사용해서 dataframe형식 그대로 뽑아낼 수 있어요
In [9]:
raw_data.iloc[[1]]
Out[9]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422

1 rows × 63 columns

  • iloc[행,열] 이 표현은 열을 뽑아낼 수 있어요
In [10]:
raw_data.iloc[:,0]
Out[10]:
0        AF
1        AF
2        AF
3        AF
4        AF
         ..
21472    ZW
21473    ZW
21474    ZW
21475    ZW
21476    ZW
Name: Area Abbreviation, Length: 21477, dtype: object
  • iloc[행,열] 이 표현은 열을 뽑아낼 수 있어요, 위와 마찬가지로 dataframe처럼 예쁘게 보일려면 [[]] 괄호2개를 넣어주세요!
In [11]:
raw_data.iloc[:,[0]]
Out[11]:
Area Abbreviation
0 AF
1 AF
2 AF
3 AF
4 AF
... ...
21472 ZW
21473 ZW
21474 ZW
21475 ZW
21476 ZW

21477 rows × 1 columns

  • 복습! 행을 여러개 뽑아보기
In [12]:
raw_data.iloc[0:5]
Out[12]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 Afghanistan 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 Afghanistan 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200

5 rows × 63 columns

  • 복습! 열을 여러개 뽑아보기
In [13]:
raw_data.iloc[:,0:2]
Out[13]:
Area Abbreviation Area Code
0 AF 2
1 AF 2
2 AF 2
3 AF 2
4 AF 2
... ... ...
21472 ZW 181
21473 ZW 181
21474 ZW 181
21475 ZW 181
21476 ZW 181

21477 rows × 2 columns

  • 복습! 행과 열을 여러개 뽑아보기
In [14]:
raw_data.iloc[0:5,0:2]
Out[14]:
Area Abbreviation Area Code
0 AF 2
1 AF 2
2 AF 2
3 AF 2
4 AF 2


  • loc()함수! 위의 iloc()와 어떤 차이점이 있을지 저도 매우 궁금했어요.
  • loc(), iloc() 비교 실습으로 알아봅시다.
  • loc여도 행1개를 추출할 수 있어요
In [15]:
raw_data.loc[[1]]
Out[15]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422

1 rows × 63 columns

  • iloc와 똑같이 뽑을 수 있어요
In [16]:
raw_data.iloc[[1]]
Out[16]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422

1 rows × 63 columns

  • loc함수는 여러개 행을 범위로 이렇게 뽑을 수 없네요
In [17]:
raw_data.loc[[1:3]]
  File "<ipython-input-17-32675df71253>", line 1
    raw_data.loc[[1:3]]
                   ^
SyntaxError: invalid syntax
  • loc함수는 여러개 행을 index를 지정해서 이렇게 뽑을 수 있어요..
In [18]:
raw_data.loc[[1,3]]
Out[18]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89

2 rows × 63 columns

  • iloc함수도 마찬가지로 index로 몇개 추출할 수 있습니다.
In [19]:
raw_data.iloc[[1,3]]
Out[19]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89

2 rows × 63 columns

  • loc함수는 범위로 뽑을려면 [[]] 괄호가 2개가 없어도 됩니다.
  • 그런데, [1:3]은 대부분 1,2 까지 뽑아내지만, loc는 1,2,3 까지 뽑아 냅니다.
In [20]:
raw_data.loc[1:3]
Out[20]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89

3 rows × 63 columns

  • iloc함수로 뽑으면 [1:3]이 1,2로 뽑힐 수 있습니다.
In [21]:
raw_data.iloc[1:3]
Out[21]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360

2 rows × 63 columns

  • 그럼 loc함수의 장점은 무엇일까요? 이제 나옵니다.
  • 열을 뽑을 때 열의 명을 가지고 뽑을 수 있어요. 사실 열의 명으로 열을 뽑는게 더 쉽잖아요! 유용하게 쓸 수 있을 것 같아요
In [22]:
raw_data.loc[[1,3],['Area Code','Area']]
Out[22]:
Area Code Area
1 2 Afghanistan
3 2 Afghanistan
  • iloc함수로는 열의 이름으로 뽑을 수 없습니다.
In [23]:
raw_data.iloc[[1,3],['Area Code','Area']]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-23-5375d4a0863b> in <module>
----> 1 raw_data.iloc[[1,3],['Area Code','Area']]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
    871                     # AttributeError for IntervalTree get_value
    872                     pass
--> 873             return self._getitem_tuple(key)
    874         else:
    875             # we by definition only have the 0th axis

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
   1441     def _getitem_tuple(self, tup: Tuple):
   1442
-> 1443         self._has_valid_tuple(tup)
   1444         try:
   1445             return self._getitem_lowerdim(tup)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
    700                 raise IndexingError("Too many indexers")
    701             try:
--> 702                 self._validate_key(k, i)
    703             except ValueError as err:
    704                 raise ValueError(

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_key(self, key, axis)
   1361             # check that the key has a numeric dtype
   1362             if not is_numeric_dtype(arr.dtype):
-> 1363                 raise IndexError(f".iloc requires numeric indexers, got {arr}")
   1364
   1365             # check that the key does not exceed the maximum size of the index

IndexError: .iloc requires numeric indexers, got ['Area Code' 'Area']
  • loc함수는 열의 명을 범위로 모두 뽑을 수 있더라구요!
  • 유용하게 쓸 수 있을 것 같아요
In [24]:
raw_data.loc[[1,3],'Item':'Y2013']
Out[24]:
Item Element Code Element Unit latitude longitude Y1961 Y1962 Y1963 Y1964 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
1 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 183.0 183.0 182.0 220.0 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
3 Barley and products 5142 Food 1000 tonnes 33.94 67.71 237.0 237.0 237.0 238.0 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89

2 rows × 59 columns

  • 내가 원하는 값 뽑아내기, 이것은 굳이 loc함수를 안써도 되어요
In [25]:
raw_data.loc[raw_data['Item']=='Sugar beet']
Out[25]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
10 AF 2 Afghanistan 2537 Sugar beet 5521 Feed 1000 tonnes 33.94 67.71 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
103 AL 3 Albania 2537 Sugar beet 5521 Feed 1000 tonnes 41.15 20.17 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1 1
699 AM 1 Armenia 2537 Sugar beet 5521 Feed 1000 tonnes 40.07 45.04 ... 1.0 2.0 1.0 2.0 1.0 3.0 1.0 1.0 0 1
832 AU 10 Australia 2537 Sugar beet 5521 Feed 1000 tonnes -25.27 133.78 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
1099 AZ 52 Azerbaijan 2537 Sugar beet 5521 Feed 1000 tonnes 40.14 47.58 ... 1.0 1.0 4.0 3.0 4.0 4.0 6.0 6.0 4 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20020 AE 225 United Arab Emirates 2537 Sugar beet 5521 Feed 1000 tonnes 23.42 53.85 ... 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0 0
20676 UZ 235 Uzbekistan 2537 Sugar beet 5521 Feed 1000 tonnes 41.38 64.59 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
20900 VE 236 Venezuela (Bolivarian Republic of) 2537 Sugar beet 5142 Food 1000 tonnes 6.42 -66.59 ... 17.0 20.0 23.0 21.0 21.0 21.0 30.0 35.0 20 22
21135 YE 249 Yemen 2537 Sugar beet 5521 Feed 1000 tonnes 15.55 48.52 ... 0.0 0.0 1.0 13.0 0.0 0.0 0.0 0.0 0 0
21374 ZW 181 Zimbabwe 2537 Sugar beet 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

66 rows × 63 columns

  • 내가 원하는 값 뽑아내기, iloc는 할 수 없어요.
In [26]:
raw_data.iloc[raw_data['Item']=='Sugar beet']
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-26-24cc43d7f2af> in <module>
----> 1 raw_data.iloc[raw_data['Item']=='Sugar beet']

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
    877
    878             maybe_callable = com.apply_if_callable(key, self.obj)
--> 879             return self._getitem_axis(maybe_callable, axis=axis)
    880
    881     def _is_scalar_access(self, key: Tuple):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1480
   1481         if com.is_bool_indexer(key):
-> 1482             self._validate_key(key, axis)
   1483             return self._getbool_axis(key, axis=axis)
   1484

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_key(self, key, axis)
   1337                 if key.index.inferred_type == "integer":
   1338                     raise NotImplementedError(
-> 1339                         "iLocation based boolean "
   1340                         "indexing on an integer type "
   1341                         "is not available"

NotImplementedError: iLocation based boolean indexing on an integer type is not available
  • 내가 원하는 값 뽑아내기, 이것은 굳이 loc함수를 안써도 되고 이렇게 그냥 list의 특성으로 뽑아 낼 수 있어요
In [27]:
raw_data[raw_data['Item']=='Sugar beet']
Out[27]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
10 AF 2 Afghanistan 2537 Sugar beet 5521 Feed 1000 tonnes 33.94 67.71 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
103 AL 3 Albania 2537 Sugar beet 5521 Feed 1000 tonnes 41.15 20.17 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1 1
699 AM 1 Armenia 2537 Sugar beet 5521 Feed 1000 tonnes 40.07 45.04 ... 1.0 2.0 1.0 2.0 1.0 3.0 1.0 1.0 0 1
832 AU 10 Australia 2537 Sugar beet 5521 Feed 1000 tonnes -25.27 133.78 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
1099 AZ 52 Azerbaijan 2537 Sugar beet 5521 Feed 1000 tonnes 40.14 47.58 ... 1.0 1.0 4.0 3.0 4.0 4.0 6.0 6.0 4 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20020 AE 225 United Arab Emirates 2537 Sugar beet 5521 Feed 1000 tonnes 23.42 53.85 ... 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0 0
20676 UZ 235 Uzbekistan 2537 Sugar beet 5521 Feed 1000 tonnes 41.38 64.59 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
20900 VE 236 Venezuela (Bolivarian Republic of) 2537 Sugar beet 5142 Food 1000 tonnes 6.42 -66.59 ... 17.0 20.0 23.0 21.0 21.0 21.0 30.0 35.0 20 22
21135 YE 249 Yemen 2537 Sugar beet 5521 Feed 1000 tonnes 15.55 48.52 ... 0.0 0.0 1.0 13.0 0.0 0.0 0.0 0.0 0 0
21374 ZW 181 Zimbabwe 2537 Sugar beet 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

66 rows × 63 columns

  • 내가 원하는 값 뽑아내기, 같은 값을 2개 뽑아낼 수도 있어요 loc함수로!
In [28]:
raw_data.loc[raw_data['Item']=='Sugar beet','Area']
Out[28]:
10                              Afghanistan
103                                 Albania
699                                 Armenia
832                               Australia
1099                             Azerbaijan
                        ...
20020                  United Arab Emirates
20676                            Uzbekistan
20900    Venezuela (Bolivarian Republic of)
21135                                 Yemen
21374                              Zimbabwe
Name: Area, Length: 66, dtype: object
  • 저는 loc함수가 이 대목에서 대박이라 생각합니다. 원하는 값을 뽑아내면서, 열 명으로 원하는 열만 추출하기!
In [29]:
raw_data.loc[raw_data['Item']=='Sugar beet',['Area']]
Out[29]:
Area
10 Afghanistan
103 Albania
699 Armenia
832 Australia
1099 Azerbaijan
... ...
20020 United Arab Emirates
20676 Uzbekistan
20900 Venezuela (Bolivarian Republic of)
21135 Yemen
21374 Zimbabwe

66 rows × 1 columns

  • list의 특성으로는 안되요
In [30]:
raw_data[raw_data['Item']=='Sugar beet',['Area']]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-30-d0ec0ffd7749> in <module>
----> 1 raw_data[raw_data['Item']=='Sugar beet',['Area']]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2904             if self.columns.nlevels > 1:
   2905                 return self._getitem_multilevel(key)
-> 2906             indexer = self.columns.get_loc(key)
   2907             if is_integer(indexer):
   2908                 indexer = [indexer]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2896             casted_key = self._maybe_cast_indexer(key)
   2897             try:
-> 2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:
   2900                 raise KeyError(key) from err

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

TypeError: '(0        False
1        False
2        False
3        False
4        False
         ...
21472    False
21473    False
21474    False
21475    False
21476    False
Name: Item, Length: 21477, dtype: bool, ['Area'])' is an invalid key
  • 저는 loc함수가 이 대목에서 대박이라 생각합니다2. 원하는 값을 뽑아내면서, 열 명으로 원하는 여러 열도 추출하기!
In [31]:
raw_data.loc[raw_data['Item']=='Sugar beet',['Area','Item','latitude']]
Out[31]:
Area Item latitude
10 Afghanistan Sugar beet 33.94
103 Albania Sugar beet 41.15
699 Armenia Sugar beet 40.07
832 Australia Sugar beet -25.27
1099 Azerbaijan Sugar beet 40.14
... ... ... ...
20020 United Arab Emirates Sugar beet 23.42
20676 Uzbekistan Sugar beet 41.38
20900 Venezuela (Bolivarian Republic of) Sugar beet 6.42
21135 Yemen Sugar beet 15.55
21374 Zimbabwe Sugar beet -19.02

66 rows × 3 columns

  • endswith('many')함수는 값뒤에 'many'가 있으면 추출하는 함수입니다.
In [32]:
raw_data.loc[raw_data['Area'].str.endswith('many')]
Out[32]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
7532 DE 79 Germany 2511 Wheat and products 5521 Feed 1000 tonnes 51.17 10.45 ... 8006.0 11084.0 10644.0 8993.0 10669.0 10608.0 8189.0 9242.0 7868 7494
7533 DE 79 Germany 2511 Wheat and products 5142 Food 1000 tonnes 51.17 10.45 ... 6524.0 6931.0 6796.0 6886.0 6868.0 7137.0 7235.0 7204.0 6712 6900
7534 DE 79 Germany 2805 Rice (Milled Equivalent) 5521 Feed 1000 tonnes 51.17 10.45 ... 7.0 9.0 12.0 17.0 8.0 18.0 13.0 12.0 7 7
7535 DE 79 Germany 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 51.17 10.45 ... 206.0 212.0 226.0 244.0 251.0 257.0 243.0 271.0 275 277
7536 DE 79 Germany 2513 Barley and products 5521 Feed 1000 tonnes 51.17 10.45 ... 7571.0 6878.0 7845.0 6940.0 7255.0 7592.0 7543.0 6062.0 6316 6598
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7674 DE 79 Germany 2948 Milk - Excluding Butter 5142 Food 1000 tonnes 51.17 10.45 ... 20252.0 20878.0 21323.0 21661.0 21091.0 22077.0 21431.0 21171.0 21169 21401
7675 DE 79 Germany 2960 Fish, Seafood 5521 Feed 1000 tonnes 51.17 10.45 ... 11.0 11.0 32.0 51.0 28.0 25.0 29.0 25.0 6 5
7676 DE 79 Germany 2960 Fish, Seafood 5142 Food 1000 tonnes 51.17 10.45 ... 1122.0 1177.0 1238.0 1287.0 1226.0 1187.0 1175.0 1188.0 1128 1039
7677 DE 79 Germany 2961 Aquatic Products, Other 5142 Food 1000 tonnes 51.17 10.45 ... 3.0 3.0 4.0 4.0 4.0 4.0 4.0 5.0 1 0
7678 DE 79 Germany 2928 Miscellaneous 5142 Food 1000 tonnes 51.17 10.45 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

147 rows × 63 columns

  • 저는 isin함수를 더 많이 쓰는데, isin(['Germany'])는 Germany 값과 같은 행을 뽑아 냅니다.
In [33]:
raw_data.loc[raw_data['Area'].isin(['Germany'])]
Out[33]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
7532 DE 79 Germany 2511 Wheat and products 5521 Feed 1000 tonnes 51.17 10.45 ... 8006.0 11084.0 10644.0 8993.0 10669.0 10608.0 8189.0 9242.0 7868 7494
7533 DE 79 Germany 2511 Wheat and products 5142 Food 1000 tonnes 51.17 10.45 ... 6524.0 6931.0 6796.0 6886.0 6868.0 7137.0 7235.0 7204.0 6712 6900
7534 DE 79 Germany 2805 Rice (Milled Equivalent) 5521 Feed 1000 tonnes 51.17 10.45 ... 7.0 9.0 12.0 17.0 8.0 18.0 13.0 12.0 7 7
7535 DE 79 Germany 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 51.17 10.45 ... 206.0 212.0 226.0 244.0 251.0 257.0 243.0 271.0 275 277
7536 DE 79 Germany 2513 Barley and products 5521 Feed 1000 tonnes 51.17 10.45 ... 7571.0 6878.0 7845.0 6940.0 7255.0 7592.0 7543.0 6062.0 6316 6598
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7674 DE 79 Germany 2948 Milk - Excluding Butter 5142 Food 1000 tonnes 51.17 10.45 ... 20252.0 20878.0 21323.0 21661.0 21091.0 22077.0 21431.0 21171.0 21169 21401
7675 DE 79 Germany 2960 Fish, Seafood 5521 Feed 1000 tonnes 51.17 10.45 ... 11.0 11.0 32.0 51.0 28.0 25.0 29.0 25.0 6 5
7676 DE 79 Germany 2960 Fish, Seafood 5142 Food 1000 tonnes 51.17 10.45 ... 1122.0 1177.0 1238.0 1287.0 1226.0 1187.0 1175.0 1188.0 1128 1039
7677 DE 79 Germany 2961 Aquatic Products, Other 5142 Food 1000 tonnes 51.17 10.45 ... 3.0 3.0 4.0 4.0 4.0 4.0 4.0 5.0 1 0
7678 DE 79 Germany 2928 Miscellaneous 5142 Food 1000 tonnes 51.17 10.45 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

147 rows × 63 columns

In [34]:
raw_data[raw_data['Area'].isin(['Germany'])]
Out[34]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
7532 DE 79 Germany 2511 Wheat and products 5521 Feed 1000 tonnes 51.17 10.45 ... 8006.0 11084.0 10644.0 8993.0 10669.0 10608.0 8189.0 9242.0 7868 7494
7533 DE 79 Germany 2511 Wheat and products 5142 Food 1000 tonnes 51.17 10.45 ... 6524.0 6931.0 6796.0 6886.0 6868.0 7137.0 7235.0 7204.0 6712 6900
7534 DE 79 Germany 2805 Rice (Milled Equivalent) 5521 Feed 1000 tonnes 51.17 10.45 ... 7.0 9.0 12.0 17.0 8.0 18.0 13.0 12.0 7 7
7535 DE 79 Germany 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 51.17 10.45 ... 206.0 212.0 226.0 244.0 251.0 257.0 243.0 271.0 275 277
7536 DE 79 Germany 2513 Barley and products 5521 Feed 1000 tonnes 51.17 10.45 ... 7571.0 6878.0 7845.0 6940.0 7255.0 7592.0 7543.0 6062.0 6316 6598
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7674 DE 79 Germany 2948 Milk - Excluding Butter 5142 Food 1000 tonnes 51.17 10.45 ... 20252.0 20878.0 21323.0 21661.0 21091.0 22077.0 21431.0 21171.0 21169 21401
7675 DE 79 Germany 2960 Fish, Seafood 5521 Feed 1000 tonnes 51.17 10.45 ... 11.0 11.0 32.0 51.0 28.0 25.0 29.0 25.0 6 5
7676 DE 79 Germany 2960 Fish, Seafood 5142 Food 1000 tonnes 51.17 10.45 ... 1122.0 1177.0 1238.0 1287.0 1226.0 1187.0 1175.0 1188.0 1128 1039
7677 DE 79 Germany 2961 Aquatic Products, Other 5142 Food 1000 tonnes 51.17 10.45 ... 3.0 3.0 4.0 4.0 4.0 4.0 4.0 5.0 1 0
7678 DE 79 Germany 2928 Miscellaneous 5142 Food 1000 tonnes 51.17 10.45 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

147 rows × 63 columns

  • 저는 isin함수를 더 많이 쓰는데, isin([])는 2개이상으로 같은 값을 추출할 수 있어요
In [35]:
raw_data[raw_data['Area'].isin(['Germany','France'])]
Out[35]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
6910 FR 68 France 2511 Wheat and products 5521 Feed 1000 tonnes 46.23 2.21 ... 9867.0 11311.0 10091.0 8483.0 10283.0 8351.0 6262.0 7727.0 7179 7822
6911 FR 68 France 2511 Wheat and products 5142 Food 1000 tonnes 46.23 2.21 ... 6057.0 6140.0 6402.0 6102.0 6677.0 6331.0 6986.0 6765.0 6984 6971
6912 FR 68 France 2805 Rice (Milled Equivalent) 5521 Feed 1000 tonnes 46.23 2.21 ... 80.0 75.0 84.0 88.0 92.0 82.0 81.0 91.0 99 101
6913 FR 68 France 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 46.23 2.21 ... 296.0 325.0 317.0 367.0 350.0 347.0 339.0 361.0 339 314
6914 FR 68 France 2513 Barley and products 5521 Feed 1000 tonnes 46.23 2.21 ... 3537.0 3560.0 4014.0 3446.0 4275.0 4428.0 4332.0 3137.0 3163 2865
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7674 DE 79 Germany 2948 Milk - Excluding Butter 5142 Food 1000 tonnes 51.17 10.45 ... 20252.0 20878.0 21323.0 21661.0 21091.0 22077.0 21431.0 21171.0 21169 21401
7675 DE 79 Germany 2960 Fish, Seafood 5521 Feed 1000 tonnes 51.17 10.45 ... 11.0 11.0 32.0 51.0 28.0 25.0 29.0 25.0 6 5
7676 DE 79 Germany 2960 Fish, Seafood 5142 Food 1000 tonnes 51.17 10.45 ... 1122.0 1177.0 1238.0 1287.0 1226.0 1187.0 1175.0 1188.0 1128 1039
7677 DE 79 Germany 2961 Aquatic Products, Other 5142 Food 1000 tonnes 51.17 10.45 ... 3.0 3.0 4.0 4.0 4.0 4.0 4.0 5.0 1 0
7678 DE 79 Germany 2928 Miscellaneous 5142 Food 1000 tonnes 51.17 10.45 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

287 rows × 63 columns

In [36]:
raw_data.loc[(raw_data['Area'].str.endswith('many')) & (raw_data['Element']=='Feed')]
Out[36]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
7532 DE 79 Germany 2511 Wheat and products 5521 Feed 1000 tonnes 51.17 10.45 ... 8006.0 11084.0 10644.0 8993.0 10669.0 10608.0 8189.0 9242.0 7868 7494
7534 DE 79 Germany 2805 Rice (Milled Equivalent) 5521 Feed 1000 tonnes 51.17 10.45 ... 7.0 9.0 12.0 17.0 8.0 18.0 13.0 12.0 7 7
7536 DE 79 Germany 2513 Barley and products 5521 Feed 1000 tonnes 51.17 10.45 ... 7571.0 6878.0 7845.0 6940.0 7255.0 7592.0 7543.0 6062.0 6316 6598
7538 DE 79 Germany 2514 Maize and products 5521 Feed 1000 tonnes 51.17 10.45 ... 3356.0 3481.0 2947.0 3752.0 4199.0 3895.0 3572.0 4251.0 5434 6136
7540 DE 79 Germany 2515 Rye and products 5521 Feed 1000 tonnes 51.17 10.45 ... 1400.0 1300.0 1287.0 1391.0 1858.0 2416.0 1625.0 1360.0 2150 3318
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7661 DE 79 Germany 2918 Vegetables 5521 Feed 1000 tonnes 51.17 10.45 ... 89.0 83.0 84.0 84.0 89.0 94.0 89.0 89.0 91 90
7668 DE 79 Germany 2945 Offals 5521 Feed 1000 tonnes 51.17 10.45 ... 2.0 2.0 2.0 2.0 1.0 1.0 1.0 1.0 1 1
7670 DE 79 Germany 2946 Animal fats 5521 Feed 1000 tonnes 51.17 10.45 ... 16.0 18.0 20.0 15.0 16.0 14.0 18.0 18.0 20 17
7673 DE 79 Germany 2948 Milk - Excluding Butter 5521 Feed 1000 tonnes 51.17 10.45 ... 1813.0 1701.0 1595.0 1633.0 2224.0 1503.0 1468.0 1771.0 2088 2064
7675 DE 79 Germany 2960 Fish, Seafood 5521 Feed 1000 tonnes 51.17 10.45 ... 11.0 11.0 32.0 51.0 28.0 25.0 29.0 25.0 6 5

43 rows × 63 columns

  • list특성으로 값을 추출할 수 있는데, & 기호(and 라는 뜻)를 이용해서 값을 추출 할 수 있습니다.
In [37]:
raw_data[(raw_data['Area'].str.endswith('many')) & (raw_data['Element']=='Feed')]
Out[37]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
7532 DE 79 Germany 2511 Wheat and products 5521 Feed 1000 tonnes 51.17 10.45 ... 8006.0 11084.0 10644.0 8993.0 10669.0 10608.0 8189.0 9242.0 7868 7494
7534 DE 79 Germany 2805 Rice (Milled Equivalent) 5521 Feed 1000 tonnes 51.17 10.45 ... 7.0 9.0 12.0 17.0 8.0 18.0 13.0 12.0 7 7
7536 DE 79 Germany 2513 Barley and products 5521 Feed 1000 tonnes 51.17 10.45 ... 7571.0 6878.0 7845.0 6940.0 7255.0 7592.0 7543.0 6062.0 6316 6598
7538 DE 79 Germany 2514 Maize and products 5521 Feed 1000 tonnes 51.17 10.45 ... 3356.0 3481.0 2947.0 3752.0 4199.0 3895.0 3572.0 4251.0 5434 6136
7540 DE 79 Germany 2515 Rye and products 5521 Feed 1000 tonnes 51.17 10.45 ... 1400.0 1300.0 1287.0 1391.0 1858.0 2416.0 1625.0 1360.0 2150 3318
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7661 DE 79 Germany 2918 Vegetables 5521 Feed 1000 tonnes 51.17 10.45 ... 89.0 83.0 84.0 84.0 89.0 94.0 89.0 89.0 91 90
7668 DE 79 Germany 2945 Offals 5521 Feed 1000 tonnes 51.17 10.45 ... 2.0 2.0 2.0 2.0 1.0 1.0 1.0 1.0 1 1
7670 DE 79 Germany 2946 Animal fats 5521 Feed 1000 tonnes 51.17 10.45 ... 16.0 18.0 20.0 15.0 16.0 14.0 18.0 18.0 20 17
7673 DE 79 Germany 2948 Milk - Excluding Butter 5521 Feed 1000 tonnes 51.17 10.45 ... 1813.0 1701.0 1595.0 1633.0 2224.0 1503.0 1468.0 1771.0 2088 2064
7675 DE 79 Germany 2960 Fish, Seafood 5521 Feed 1000 tonnes 51.17 10.45 ... 11.0 11.0 32.0 51.0 28.0 25.0 29.0 25.0 6 5

43 rows × 63 columns

In [38]:
raw_data[(raw_data['Y2004'] < 1000) & (raw_data['Y2004'] > 990)]
Out[38]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
3754 CL 40 Chile 2531 Potatoes and products 5142 Food 1000 tonnes -35.68 -71.54 ... 998.0 980.0 1018.0 875.0 884.0 850.0 1003.0 1132.0 1096 1089
7630 DE 79 Germany 2744 Eggs 5142 Food 1000 tonnes 51.17 10.45 ... 999.0 974.0 1011.0 997.0 1011.0 1016.0 1032.0 1052.0 990 1010
7672 DE 79 Germany 2949 Eggs 5142 Food 1000 tonnes 51.17 10.45 ... 999.0 974.0 1011.0 997.0 1011.0 1016.0 1032.0 1052.0 990 1010
9038 IN 100 India 2513 Barley and products 5142 Food 1000 tonnes 20.59 78.96 ... 995.0 954.0 944.0 691.0 740.0 1276.0 984.0 1217.0 1152 835
9104 IN 100 India 2641 Pimento 5142 Food 1000 tonnes 20.59 78.96 ... 998.0 844.0 1012.0 1010.0 1019.0 941.0 893.0 954.0 871 1018
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13436 MM 28 Myanmar 2948 Milk - Excluding Butter 5142 Food 1000 tonnes 21.91 95.96 ... 992.0 1069.0 1117.0 1253.0 1298.0 1472.0 1581.0 1660.0 1600 1677
13669 NP 149 Nepal 2919 Fruits - Excluding Wine 5142 Food 1000 tonnes 28.39 84.12 ... 993.0 1050.0 1044.0 1098.0 1195.0 1178.0 1301.0 1428.0 1859 1700
15525 PT 174 Portugal 2511 Wheat and products 5142 Food 1000 tonnes 39.40 -8.22 ... 991.0 982.0 995.0 1002.0 976.0 970.0 978.0 980.0 994 1000
19714 TM 213 Turkmenistan 2905 Cereals - Excluding Beer 5142 Food 1000 tonnes 38.97 59.56 ... 995.0 1009.0 1031.0 1019.0 1017.0 1023.0 1019.0 1039.0 1036 1059
20768 UZ 235 Uzbekistan 2919 Fruits - Excluding Wine 5142 Food 1000 tonnes 41.38 64.59 ... 993.0 1079.0 1448.0 1538.0 1693.0 1875.0 2108.0 2159.0 2422 2638

14 rows × 63 columns

  • loc함수의 가장 큰 장점을 이용하여 뽑을 수 있어요! 장점은 열의 명으로 열을 추출하기!
In [39]:
raw_data.loc[(raw_data['Y2004'] < 1000) & (raw_data['Y2004'] > 990), ['Area', 'Item', 'latitude']]
Out[39]:
Area Item latitude
3754 Chile Potatoes and products -35.68
7630 Germany Eggs 51.17
7672 Germany Eggs 51.17
9038 India Barley and products 20.59
9104 India Pimento 20.59
... ... ... ...
13436 Myanmar Milk - Excluding Butter 21.91
13669 Nepal Fruits - Excluding Wine 28.39
15525 Portugal Wheat and products 39.40
19714 Turkmenistan Cereals - Excluding Beer 38.97
20768 Uzbekistan Fruits - Excluding Wine 41.38

14 rows × 3 columns


  • drop함수를 이용해서 열 삭제하기. 꼭 axis=1를 써야 열이 삭제가 되어요
In [40]:
raw_data.drop('Area',axis=1)
Out[40]:
Area Abbreviation Area Code Item Code Item Element Code Element Unit latitude longitude Y1961 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 1928.0 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 183.0 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 76.0 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 237.0 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 210.0 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 230.0 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 27.0 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 6.0 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21477 rows × 62 columns

  • drop함수 내의 columns=를 지정해주면, axis=1을 안써도 됩니다.
In [41]:
raw_data.drop(columns="Area")
Out[41]:
Area Abbreviation Area Code Item Code Item Element Code Element Unit latitude longitude Y1961 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
0 AF 2 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 1928.0 ... 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0 4711.0 4810 4895
1 AF 2 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 183.0 ... 419.0 445.0 546.0 455.0 490.0 415.0 442.0 476.0 425 422
2 AF 2 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 76.0 ... 58.0 236.0 262.0 263.0 230.0 379.0 315.0 203.0 367 360
3 AF 2 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 237.0 ... 185.0 43.0 44.0 48.0 62.0 55.0 60.0 72.0 78 89
4 AF 2 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 210.0 ... 120.0 208.0 233.0 249.0 247.0 195.0 178.0 191.0 200 200
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 230.0 ... 373.0 357.0 359.0 356.0 341.0 385.0 418.0 457.0 426 451
21473 ZW 181 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 27.0 ... 5.0 4.0 9.0 6.0 9.0 5.0 15.0 15.0 15 15
21474 ZW 181 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 6.0 ... 18.0 14.0 17.0 14.0 15.0 18.0 29.0 40.0 40 40
21475 ZW 181 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
21476 ZW 181 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0

21477 rows × 62 columns

  • 여러 열을 삭제하기
In [42]:
raw_data.drop(['Y2011','Y2012','Y2013'], axis=1)
Out[42]:
Area Abbreviation Area Code Area Item Code Item Element Code Element Unit latitude longitude ... Y2001 Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010
0 AF 2 Afghanistan 2511 Wheat and products 5142 Food 1000 tonnes 33.94 67.71 ... 2668.0 2776.0 3095.0 3249.0 3486.0 3704.0 4164.0 4252.0 4538.0 4605.0
1 AF 2 Afghanistan 2805 Rice (Milled Equivalent) 5142 Food 1000 tonnes 33.94 67.71 ... 411.0 448.0 460.0 419.0 445.0 546.0 455.0 490.0 415.0 442.0
2 AF 2 Afghanistan 2513 Barley and products 5521 Feed 1000 tonnes 33.94 67.71 ... 29.0 70.0 48.0 58.0 236.0 262.0 263.0 230.0 379.0 315.0
3 AF 2 Afghanistan 2513 Barley and products 5142 Food 1000 tonnes 33.94 67.71 ... 83.0 122.0 144.0 185.0 43.0 44.0 48.0 62.0 55.0 60.0
4 AF 2 Afghanistan 2514 Maize and products 5521 Feed 1000 tonnes 33.94 67.71 ... 48.0 89.0 63.0 120.0 208.0 233.0 249.0 247.0 195.0 178.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
21472 ZW 181 Zimbabwe 2948 Milk - Excluding Butter 5142 Food 1000 tonnes -19.02 29.15 ... 439.0 360.0 386.0 373.0 357.0 359.0 356.0 341.0 385.0 418.0
21473 ZW 181 Zimbabwe 2960 Fish, Seafood 5521 Feed 1000 tonnes -19.02 29.15 ... 5.0 1.0 0.0 5.0 4.0 9.0 6.0 9.0 5.0 15.0
21474 ZW 181 Zimbabwe 2960 Fish, Seafood 5142 Food 1000 tonnes -19.02 29.15 ... 18.0 16.0 14.0 18.0 14.0 17.0 14.0 15.0 18.0 29.0
21475 ZW 181 Zimbabwe 2961 Aquatic Products, Other 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
21476 ZW 181 Zimbabwe 2928 Miscellaneous 5142 Food 1000 tonnes -19.02 29.15 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

21477 rows × 60 columns


  • unique()함수는 행의 값중에서 1개씩만 추출해주는 함수에요
In [43]:
raw_data['Item'].unique()
Out[43]:
array(['Wheat and products', 'Rice (Milled Equivalent)',
       'Barley and products', 'Maize and products', 'Millet and products',
       'Cereals, Other', 'Potatoes and products', 'Sugar cane',
       'Sugar beet', 'Sugar (Raw Equivalent)', 'Sweeteners, Other',
       'Honey', 'Pulses, Other and products', 'Nuts and products',
       'Coconuts - Incl Copra', 'Sesame seed',
       'Olives (including preserved)', 'Soyabean Oil', 'Groundnut Oil',
       'Sunflowerseed Oil', 'Rape and Mustard Oil', 'Cottonseed Oil',
       'Palm Oil', 'Sesameseed Oil', 'Olive Oil', 'Oilcrops Oil, Other',
       'Tomatoes and products', 'Vegetables, Other',
       'Oranges, Mandarines', 'Citrus, Other', 'Bananas',
       'Apples and products', 'Pineapples and products', 'Dates',
       'Grapes and products (excl wine)', 'Fruits, Other',
       'Coffee and products', 'Cocoa Beans and products',
       'Tea (including mate)', 'Pepper', 'Spices, Other', 'Wine', 'Beer',
       'Beverages, Alcoholic', 'Bovine Meat', 'Mutton & Goat Meat',
       'Poultry Meat', 'Meat, Other', 'Offals, Edible', 'Butter, Ghee',
       'Cream', 'Fats, Animals, Raw', 'Eggs', 'Milk - Excluding Butter',
       'Freshwater Fish', 'Infant food', 'Cereals - Excluding Beer',
       'Starchy Roots', 'Sugar Crops', 'Sugar & Sweeteners', 'Pulses',
       'Treenuts', 'Oilcrops', 'Vegetable Oils', 'Vegetables',
       'Fruits - Excluding Wine', 'Stimulants', 'Spices',
       'Alcoholic Beverages', 'Meat', 'Offals', 'Animal fats',
       'Fish, Seafood', 'Miscellaneous', 'Rye and products', 'Oats',
       'Sorghum and products', 'Cassava and products', 'Sweet potatoes',
       'Roots, Other', 'Beans', 'Peas', 'Soyabeans',
       'Groundnuts (Shelled Eq)', 'Rape and Mustardseed',
       'Oilcrops, Other', 'Maize Germ Oil', 'Onions',
       'Lemons, Limes and products', 'Grapefruit and products',
       'Plantains', 'Pimento', 'Beverages, Fermented', 'Pigmeat',
       'Fish, Body Oil', 'Fish, Liver Oil', 'Demersal Fish',
       'Pelagic Fish', 'Marine Fish, Other', 'Crustaceans', 'Cephalopods',
       'Molluscs, Other', 'Aquatic Animals, Others', 'Aquatic Plants',
       'Aquatic Products, Other', 'Palm kernels', 'Coconut Oil', 'Cloves',
       'Palmkernel Oil', 'Yams', 'Sunflower seed', 'Cottonseed',
       'Sugar non-centrifugal', 'Ricebran Oil', 'Meat, Aquatic Mammals'],
      dtype=object)
  • value_counts()는 유일한 값 중에서 몇개 있는지 값을 세어주는 함수에요. 유용합니다.
In [44]:
raw_data['Item'].value_counts()
Out[44]:
Milk - Excluding Butter     558
Eggs                        360
Cereals - Excluding Beer    347
Fish, Seafood               337
Maize and products          333
                           ...
Sugar non-centrifugal        32
Palm kernels                 24
Cottonseed                   21
Ricebran Oil                 18
Meat, Aquatic Mammals         3
Name: Item, Length: 115, dtype: int64
  • sort_values()함수로 오름차순으로 추출해줄 수 있어요
In [74]:
raw_data['Item'].value_counts().sort_values()
Out[74]:
Meat, Aquatic Mammals         3
Ricebran Oil                 18
Cottonseed                   21
Palm kernels                 24
Sugar non-centrifugal        32
                           ...
Maize and products          333
Fish, Seafood               337
Cereals - Excluding Beer    347
Eggs                        360
Milk - Excluding Butter     558
Name: Item, Length: 115, dtype: int64
In [73]:
raw_data['Item'].sort_values()
Out[73]:
16297    Alcoholic Beverages
7390     Alcoholic Beverages
320      Alcoholic Beverages
14592    Alcoholic Beverages
16818    Alcoholic Beverages
                ...
5796                    Yams
9825                    Yams
20284                   Yams
13703                   Yams
15545                   Yams
Name: Item, Length: 21477, dtype: object


  • 정말 유용한 groupby함수를 간단히 맛보기로 보자면,
  • groupby(['Area'])는 Area값별로 각 열의 카운트(count())를 세어주는 거에요
  • groupby는 단독으로 사용할수 없고, count(), sum(), max() 등과 같은 연산함수와 사용해야합니다.
In [52]:
raw_data.groupby(['Area']).count()
Out[52]:
Area Abbreviation Area Code Item Code Item Element Code Element Unit latitude longitude Y1961 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
Area
Afghanistan 83 83 83 83 83 83 83 83 83 83 ... 83 83 83 83 83 83 83 83 83 83
Albania 123 123 123 123 123 123 123 123 123 123 ... 123 123 123 123 123 123 123 123 123 123
Algeria 124 124 124 124 124 124 124 124 124 124 ... 124 124 124 124 124 124 124 124 124 124
Angola 109 109 109 109 109 109 109 109 109 109 ... 109 109 109 109 109 109 109 109 109 109
Antigua and Barbuda 117 117 117 117 117 117 117 117 117 117 ... 117 117 117 117 117 117 117 117 117 117
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Venezuela (Bolivarian Republic of) 130 130 130 130 130 130 130 130 130 130 ... 130 130 130 130 130 130 130 130 130 130
Viet Nam 109 109 109 109 109 109 109 109 109 109 ... 109 109 109 109 109 109 109 109 109 109
Yemen 119 119 119 119 119 119 119 119 119 119 ... 119 119 119 119 119 119 119 119 119 119
Zambia 120 120 120 120 120 120 120 120 120 120 ... 120 120 120 120 120 120 120 120 120 120
Zimbabwe 121 121 121 121 121 121 121 121 121 121 ... 121 121 121 121 121 121 121 121 121 121

174 rows × 62 columns

In [58]:
raw_data.groupby(['Area Code']).sum()
Out[58]:
Element Code latitude longitude Y1961 Y1962 Y1963 Y1964 Y1965 Y1966 Y1967 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
Area Code
1 697151 5329.31 5990.32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 5515.0 5992.0 6228.0 6753.0 6958.0 6824.0 5908.0 6423.0 6978 7175
2 430576 2817.02 5619.93 9481.0 9414.0 9194.0 10170.0 10473.0 10169.0 11289.0 ... 16542.0 17658.0 18317.0 19248.0 19381.0 20661.0 21030.0 21100.0 22706 23007
3 641941 5061.45 2480.91 1706.0 1749.0 1767.0 1889.0 1884.0 1995.0 2046.0 ... 6637.0 6719.0 6911.0 6744.0 7168.0 7316.0 7907.0 8114.0 8221 8271
4 645946 3475.72 205.84 7488.0 7235.0 6861.0 7255.0 7509.0 7536.0 7986.0 ... 48619.0 49562.0 51067.0 49933.0 50916.0 57505.0 60071.0 65852.0 69365 72161
7 566163 -1220.80 1947.83 4834.0 4775.0 5240.0 5286.0 5527.0 5677.0 5833.0 ... 25541.0 26696.0 28247.0 29877.0 32053.0 36985.0 38400.0 40573.0 38064 48639
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
255 712577 6868.00 607.92 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 35273.0 32353.0 32234.0 35016.0 34902.0 34964.0 32708.0 32196.0 31326 33336
256 665162 6327.14 778.51 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1308.0 1290.0 1295.0 1300.0 1377.0 1345.0 1375.0 1432.0 1501 1511
272 664025 5590.54 2668.27 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 25423.0 22473.0 26902.0 25401.0 25350.0 24799.0 16587 25399
273 614715 5039.78 2285.66 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1605.0 1659.0 1860.0 1891.0 1810.0 1887.0 1786 1848
276 538558 1337.44 3142.88 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 42543 43661

174 rows × 56 columns

  • groupby함수를 사용해서, group화 하면서 내가 원하는 행만 추출할 수 있어요
In [61]:
raw_data.groupby(['Area Code'])['Y1961'].count()
Out[61]:
Area Code
1        0
2       83
3      123
4      124
7      109
      ...
255      0
256      0
272      0
273      0
276      0
Name: Y1961, Length: 174, dtype: int64
  • groupby함수를 사용해서, group화 하면서 내가 원하는 행만 추출할 수 있어요. 예쁘게 할려면 [[]] 2개의 괄호를 써야되는거 위에서도 보셧죠?
In [62]:
raw_data.groupby(['Area Code'])[['Y1961']].count()
Out[62]:
Y1961
Area Code
1 0
2 83
3 123
4 124
7 109
... ...
255 0
256 0
272 0
273 0
276 0

174 rows × 1 columns

In [64]:
raw_data.groupby(['Area Code','Y1961']).count()
Out[64]:
Area Abbreviation Area Item Code Item Element Code Element Unit latitude longitude Y1962 ... Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013
Area Code Y1961
2 0.0 27 27 27 27 27 27 27 27 27 27 ... 27 27 27 27 27 27 27 27 27 27
1.0 2 2 2 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2 2 2 2
2.0 4 4 4 4 4 4 4 4 4 4 ... 4 4 4 4 4 4 4 4 4 4
3.0 3 3 3 3 3 3 3 3 3 3 ... 3 3 3 3 3 3 3 3 3 3
4.0 3 3 3 3 3 3 3 3 3 3 ... 3 3 3 3 3 3 3 3 3 3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
251 165.0 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
358.0 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
373.0 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
499.0 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
578.0 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1

5681 rows × 61 columns

In [69]:
raw_data.groupby(['Item'])['Y2010'].sum()
Out[69]:
Item
Alcoholic Beverages        247414.0
Animal fats                 24267.0
Apples and products         62436.0
Aquatic Animals, Others      1340.0
Aquatic Plants              12767.0
                             ...
Vegetables                 950541.0
Vegetables, Other          745438.0
Wheat and products         549926.0
Wine                        22297.0
Yams                        35898.0
Name: Y2010, Length: 115, dtype: float64
In [72]:
raw_data['Item'].isin(raw_data['Item']).value_counts()
Out[72]:
True    21477
Name: Item, dtype: int64
In [ ]: